[PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

This patchset replaces the current barrier implementation with
sequenced flush which doesn't impose any restriction on ordering
around the flush requests. This patchst is result of the following
discussion thread.

http://thread.gmane.org/gmane.linux.file-systems/43877

In summary, filesystems can take over the ordering of requests around
commit writes and the block layer should just supply a mechanism to
perform the commit writes themselves. This would greatly lessen tha
stall caused by queue dumping and draining used by the current barrier
implementation for request ordering.

This patchset converts barrier mechanism to sequenced flush/fua
mechanism in the following steps.

1. Kill the mostly unused ORDERED_BY_TAG support.

2. Deprecate REQ_HARDBARRIER support. All hard barrier requests are
failed with -EOPNOTSUPP.

3. Drop barrier ordering by queue draining mechanism.

4. Rename barrier to flush and implement new interface based on
REQ_FLUSH and REQ_FUA as suggested by Christoph.

blkdev_issue_flush() is converted to use the new mechanism but all the
filesystems still use the deprecated REQ_HARDBARRIER which always
fails. Each filesystem needs to be updated to enforce request
ordering themselves and then to use REQ_FLUSH/FUA mechanism.

loop, md, dm, etc... haven't been converted yet and REQ_FLUSH/FUA
doesn't work with them yet. I'll convert most of them soonish if this
patchset is generally agreed upon.

This patchset contains the following patches.

0001-block-loop-queue-ordered-mode-should-be-DRAIN_FLUSH.pat ch
0002-block-kill-QUEUE_ORDERED_BY_TAG.patch
0003-block-deprecate-barrier-and-replace-blk_queue_ordere.pa tch
0004-block-remove-spurious-uses-of-REQ_HARDBARRIER.patch
0005-block-misc-cleanups-in-barrier-code.patch
0006-block-drop-barrier-ordering-by-queue-draining.patch
0007-block-rename-blk-barrier.c-to-blk-flush.c.patch
0008-block-rename-barrier-ordered-to-flush.patch
0009-block-implement-REQ_FLUSH-FUA-based-interface-for-FL.pa tch
0010-fs-block-propagate-REQ_FLUSH-FUA-interface-to-upper-.pa tch
0011-block-use-REQ_FLUSH-in-blkdev_issue_flush.patch

and is also available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contains the following changes.

block/Makefile | 2
block/blk-barrier.c | 350 ------------------------------------
block/blk-core.c | 55 ++---
block/blk-flush.c | 248 +++++++++++++++++++++++++
block/blk-settings.c | 20 ++
block/blk.h | 8
block/elevator.c | 79 --------
drivers/block/brd.c | 1
drivers/block/loop.c | 2
drivers/block/osdblk.c | 5
drivers/block/pktcdvd.c | 1
drivers/block/ps3disk.c | 2
drivers/block/virtio_blk.c | 34 ---
drivers/block/xen-blkfront.c | 47 +---
drivers/ide/ide-disk.c | 13 -
drivers/md/dm.c | 2
drivers/mmc/card/queue.c | 1
drivers/s390/block/dasd.c | 1
drivers/scsi/aic7xxx_old.c | 21 --
drivers/scsi/libsas/sas_scsi_host.c | 13 -
drivers/scsi/sd.c | 18 -
fs/buffer.c | 27 +-
include/linux/blk_types.h | 4
include/linux/blkdev.h | 73 +------
include/linux/buffer_head.h | 8
include/linux/fs.h | 20 +-
include/scsi/scsi_tcq.h | 6
27 files changed, 402 insertions(+), 659 deletions(-)

Thanks.

--
tejun
Tejun Heo [ Do, 12 August 2010 14:41 ] [ ID #2045914 ]

[PATCH 04/11] block: remove spurious uses of REQ_HARDBARRIER

REQ_HARDBARRIER is deprecated. Remove spurious uses in the following
users. Please note that other than osdblk, all other uses were
already spurious before deprecation.

* osdblk: osdblk_rq_fn() won't receive any request with
REQ_HARDBARRIER set. Remove the test for it.

* pktcdvd: use of REQ_HARDBARRIER in pkt_generic_packet() doesn't mean
anything. Removed.

* aic7xxx_old: Setting MSG_ORDERED_Q_TAG on REQ_HARDBARRIER is
spurious. Removed.

* sas_scsi_host: Setting TASK_ATTR_ORDERED on REQ_HARDBARRIER is
spurious. Removed.

* scsi_tcq: The ordered tag path wasn't being used anyway. Removed.

Signed-off-by: Tejun Heo <tj [at] kernel.org>
Cc: Boaz Harrosh <bharrosh [at] panasas.com>
Cc: James Bottomley <James.Bottomley [at] suse.de>
Cc: Peter Osterlund <petero2 [at] telia.com>
---
drivers/block/osdblk.c | 3 +--
drivers/block/pktcdvd.c | 1 -
drivers/scsi/aic7xxx_old.c | 21 ++-------------------
drivers/scsi/libsas/sas_scsi_host.c | 13 +------------
include/scsi/scsi_tcq.h | 6 +-----
5 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 72d6246..87311eb 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
[at] [at] -310,8 +310,7 [at] [at] static void osdblk_rq_fn(struct request_queue *q)
break;

/* filter out block requests we don't understand */
- if (rq->cmd_type != REQ_TYPE_FS &&
- !(rq->cmd_flags & REQ_HARDBARRIER)) {
+ if (rq->cmd_type != REQ_TYPE_FS) {
blk_end_request_all(rq, 0);
continue;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index b1cbeb5..0166ea1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
[at] [at] -753,7 +753,6 [at] [at] static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *

rq->timeout = 60*HZ;
rq->cmd_type = REQ_TYPE_BLOCK_PC;
- rq->cmd_flags |= REQ_HARDBARRIER;
if (cgc->quiet)
rq->cmd_flags |= REQ_QUIET;

diff --git a/drivers/scsi/aic7xxx_old.c b/drivers/scsi/aic7xxx_old.c
index 93984c9..e1cd606 100644
--- a/drivers/scsi/aic7xxx_old.c
+++ b/drivers/scsi/aic7xxx_old.c
[at] [at] -2850,12 +2850,6 [at] [at] aic7xxx_done(struct aic7xxx_host *p, struct aic7xxx_scb *scb)
aic_dev->r_total++;
ptr = aic_dev->r_bins;
}
- if(cmd->device->simple_tags && cmd->request->cmd_flags & REQ_HARDBARRIER)
- {
- aic_dev->barrier_total++;
- if(scb->tag_action == MSG_ORDERED_Q_TAG)
- aic_dev->ordered_total++;
- }
x = scb->sg_length;
x >>= 10;
for(i=0; i<6; i++)
[at] [at] -10144,19 +10138,8 [at] [at] static void aic7xxx_buildscb(struct aic7xxx_host *p, struct scsi_cmnd *cmd,
/* We always force TEST_UNIT_READY to untagged */
if (cmd->cmnd[0] != TEST_UNIT_READY && sdptr->simple_tags)
{
- if (req->cmd_flags & REQ_HARDBARRIER)
- {
- if(sdptr->ordered_tags)
- {
- hscb->control |= MSG_ORDERED_Q_TAG;
- scb->tag_action = MSG_ORDERED_Q_TAG;
- }
- }
- else
- {
- hscb->control |= MSG_SIMPLE_Q_TAG;
- scb->tag_action = MSG_SIMPLE_Q_TAG;
- }
+ hscb->control |= MSG_SIMPLE_Q_TAG;
+ scb->tag_action = MSG_SIMPLE_Q_TAG;
}
}
if ( !(aic_dev->dtr_pending) &&
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f0cfba9..535085c 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
[at] [at] -130,17 +130,6 [at] [at] static void sas_scsi_task_done(struct sas_task *task)
sc->scsi_done(sc);
}

-static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
-{
- enum task_attribute ta = TASK_ATTR_SIMPLE;
- if (cmd->request && blk_rq_tagged(cmd->request)) {
- if (cmd->device->ordered_tags &&
- (cmd->request->cmd_flags & REQ_HARDBARRIER))
- ta = TASK_ATTR_ORDERED;
- }
- return ta;
-}
-
static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
struct domain_device *dev,
gfp_t gfp_flags)
[at] [at] -160,7 +149,7 [at] [at] static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
task->ssp_task.retry_count = 1;
int_to_scsilun(cmd->device->lun, &lun);
memcpy(task->ssp_task.LUN, &lun.scsi_lun, 8);
- task->ssp_task.task_attr = sas_scsi_get_task_attr(cmd);
+ task->ssp_task.task_attr = TASK_ATTR_SIMPLE;
memcpy(task->ssp_task.cdb, cmd->cmnd, 16);

task->scatter = scsi_sglist(cmd);
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 1723138..d6e7994 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
[at] [at] -97,13 +97,9 [at] [at] static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
{
struct request *req = cmd->request;
- struct scsi_device *sdev = cmd->device;

if (blk_rq_tagged(req)) {
- if (sdev->ordered_tags && req->cmd_flags & REQ_HARDBARRIER)
- *msg++ = MSG_ORDERED_TAG;
- else
- *msg++ = MSG_SIMPLE_TAG;
+ *msg++ = MSG_SIMPLE_TAG;
*msg++ = req->tag;
return 2;
}
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Do, 12 August 2010 14:41 ] [ ID #2045915 ]

[PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()

Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().

blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
device has write cache and can flush it, it should set REQ_FLUSH. If
the device can handle FUA writes, it should also set REQ_FUA.

All blk_queue_ordered() users are converted.

* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.

Signed-off-by: Tejun Heo <tj [at] kernel.org>
Cc: Christoph Hellwig <hch [at] infradead.org>
Cc: Nick Piggin <npiggin [at] kernel.dk>
Cc: Michael S. Tsirkin <mst [at] redhat.com>
Cc: Jeremy Fitzhardinge <jeremy [at] xensource.com>
Cc: Chris Wright <chrisw [at] sous-sol.org>
Cc: FUJITA Tomonori <fujita.tomonori [at] lab.ntt.co.jp>
Cc: Boaz Harrosh <bharrosh [at] panasas.com>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven [at] sonycom.com>
Cc: David S. Miller <davem [at] davemloft.net>
Cc: Alasdair G Kergon <agk [at] redhat.com>
Cc: Pierre Ossman <drzeus [at] drzeus.cx>
Cc: Stefan Weinhuber <wein [at] de.ibm.com>
---
block/blk-barrier.c | 29 ----------------------------
block/blk-core.c | 6 +++-
block/blk-settings.c | 20 +++++++++++++++++++
drivers/block/brd.c | 1 -
drivers/block/loop.c | 2 +-
drivers/block/osdblk.c | 2 +-
drivers/block/ps3disk.c | 2 +-
drivers/block/virtio_blk.c | 25 ++++++++---------------
drivers/block/xen-blkfront.c | 43 +++++++++++------------------------------
drivers/ide/ide-disk.c | 13 +++++------
drivers/md/dm.c | 2 +-
drivers/mmc/card/queue.c | 1 -
drivers/s390/block/dasd.c | 1 -
drivers/scsi/sd.c | 16 +++++++-------
include/linux/blkdev.h | 6 +++-
15 files changed, 67 insertions(+), 102 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index c807e9c..ed0aba5 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
[at] [at] -9,35 +9,6 [at] [at]

#include "blk.h"

-/**
- * blk_queue_ordered - does this queue support ordered writes
- * [at] q: the request queue
- * [at] ordered: one of QUEUE_ORDERED_*
- *
- * Description:
- * For journalled file systems, doing ordered writes on a commit
- * block instead of explicitly doing wait_on_buffer (which is bad
- * for performance) can be a big win. Block drivers supporting this
- * feature should call this function and indicate so.
- *
- **/
-int blk_queue_ordered(struct request_queue *q, unsigned ordered)
-{
- if (ordered != QUEUE_ORDERED_NONE &&
- ordered != QUEUE_ORDERED_DRAIN &&
- ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
- ordered != QUEUE_ORDERED_DRAIN_FUA) {
- printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
- return -EINVAL;
- }
-
- q->ordered = ordered;
- q->next_ordered = ordered;
-
- return 0;
-}
-EXPORT_SYMBOL(blk_queue_ordered);
-
/*
* Cache flushing for ordered writes handling
*/
diff --git a/block/blk-core.c b/block/blk-core.c
index 5ab3ac2..3f802dd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
[at] [at] -1203,11 +1203,13 [at] [at] static int __make_request(struct request_queue *q, struct bio *bio)
const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
int rw_flags;

- if ((bio->bi_rw & REQ_HARDBARRIER) &&
- (q->next_ordered == QUEUE_ORDERED_NONE)) {
+ /* REQ_HARDBARRIER is no more */
+ if (WARN_ONCE(bio->bi_rw & REQ_HARDBARRIER,
+ "block: HARDBARRIER is deprecated, use FLUSH/FUA instead\n")) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
+
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a234f4b..9b18afc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
[at] [at] -794,6 +794,26 [at] [at] void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
}
EXPORT_SYMBOL(blk_queue_update_dma_alignment);

+/**
+ * blk_queue_flush - configure queue's cache flush capability
+ * [at] q: the request queue for the device
+ * [at] flush: 0, REQ_FLUSH or REQ_FLUSH | REQ_FUA
+ *
+ * Tell block layer cache flush capability of [at] q. If it supports
+ * flushing, REQ_FLUSH should be set. If it supports bypassing
+ * write cache for individual writes, REQ_FUA should be set.
+ */
+void blk_queue_flush(struct request_queue *q, unsigned int flush)
+{
+ WARN_ON_ONCE(flush & ~(REQ_FLUSH | REQ_FUA));
+
+ if (WARN_ON_ONCE(!(flush & REQ_FLUSH) && (flush & REQ_FUA)))
+ flush &= ~REQ_FUA;
+
+ q->flush_flags = flush & (REQ_FLUSH | REQ_FUA);
+}
+EXPORT_SYMBOL_GPL(blk_queue_flush);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 47a4127..fa33f97 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
[at] [at] -482,7 +482,6 [at] [at] static struct brd_device *brd_alloc(int i)
if (!brd->brd_queue)
goto out_free_dev;
blk_queue_make_request(brd->brd_queue, brd_make_request);
- blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c3a4a2e..953d1e1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
[at] [at] -832,7 +832,7 [at] [at] static int loop_set_fd(struct loop_device *lo, fmode_t mode,
lo->lo_queue->unplug_fn = loop_unplug;

if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
- blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(lo->lo_queue, REQ_FLUSH);

set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 2284b4f..72d6246 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
[at] [at] -439,7 +439,7 [at] [at] static int osdblk_init_disk(struct osdblk_device *osdev)
blk_queue_stack_limits(q, osd_request_queue(osdev->osd));

blk_queue_prep_rq(q, blk_queue_start_tag);
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(q, REQ_FLUSH);

disk->queue = q;

diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
index e9da874..4911f9e 100644
--- a/drivers/block/ps3disk.c
+++ b/drivers/block/ps3disk.c
[at] [at] -468,7 +468,7 [at] [at] static int __devinit ps3disk_probe(struct ps3_system_bus_device *_dev)
blk_queue_dma_alignment(queue, dev->blk_size-1);
blk_queue_logical_block_size(queue, dev->blk_size);

- blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(queue, REQ_FLUSH);

blk_queue_max_segments(queue, -1);
blk_queue_max_segment_size(queue, dev->bounce_size);
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 7965280..d10b635 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
[at] [at] -388,22 +388,15 [at] [at] static int __devinit virtblk_probe(struct virtio_device *vdev)
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that
- * to implement write barrier support.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
- } else {
- /*
- * If the FLUSH feature is not supported we must assume that
- * the host does not perform any kind of volatile write
- * caching. We still need to drain the queue to provider
- * proper barrier semantics.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN);
- }
+ /*
+ * If the FLUSH feature is supported we do have support for
+ * flushing a volatile write cache on the host. Use that to
+ * implement write barrier support; otherwise, we must assume
+ * that the host does not perform any kind of volatile write
+ * caching.
+ */
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+ blk_queue_flush(q, REQ_FLUSH);

/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 25ffbf9..1d48f3a 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
[at] [at] -95,7 +95,7 [at] [at] struct blkfront_info
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_RING_SIZE];
unsigned long shadow_free;
- int feature_barrier;
+ unsigned int feature_flush;
int is_ready;
};

[at] [at] -418,25 +418,12 [at] [at] static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
}


-static int xlvbd_barrier(struct blkfront_info *info)
+static void xlvbd_flush(struct blkfront_info *info)
{
- int err;
- const char *barrier;
-
- switch (info->feature_barrier) {
- case QUEUE_ORDERED_DRAIN: barrier = "enabled"; break;
- case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
- default: return -EINVAL;
- }
-
- err = blk_queue_ordered(info->rq, info->feature_barrier);
-
- if (err)
- return err;
-
+ blk_queue_flush(info->rq, info->feature_flush);
printk(KERN_INFO "blkfront: %s: barriers %s\n",
- info->gd->disk_name, barrier);
- return 0;
+ info->gd->disk_name,
+ info->feature_flush ? "enabled" : "disabled");
}


[at] [at] -515,7 +502,7 [at] [at] static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
info->rq = gd->queue;
info->gd = gd;

- xlvbd_barrier(info);
+ xlvbd_flush(info);

if (vdisk_info & VDISK_READONLY)
set_disk_ro(gd, 1);
[at] [at] -661,8 +648,8 [at] [at] static irqreturn_t blkif_interrupt(int irq, void *dev_id)
printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
info->gd->disk_name);
error = -EOPNOTSUPP;
- info->feature_barrier = QUEUE_ORDERED_NONE;
- xlvbd_barrier(info);
+ info->feature_flush = 0;
+ xlvbd_flush(info);
}
/* fall through */
case BLKIF_OP_READ:
[at] [at] -1075,19 +1062,13 [at] [at] static void blkfront_connect(struct blkfront_info *info)
/*
* If there's no "feature-barrier" defined, then it means
* we're dealing with a very old backend which writes
- * synchronously; draining will do what needs to get done.
+ * synchronously; nothing to do.
*
* If there are barriers, then we use flush.
- *
- * If barriers are not supported, then there's no much we can
- * do, so just set ordering to NONE.
*/
- if (err)
- info->feature_barrier = QUEUE_ORDERED_DRAIN;
- else if (barrier)
- info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
- else
- info->feature_barrier = QUEUE_ORDERED_NONE;
+ info->feature_flush = 0;
+ if (!err && barrier)
+ info->feature_flush = REQ_FLUSH;

err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
if (err) {
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index 7433e07..7c5b01c 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
[at] [at] -516,10 +516,10 [at] [at] static int ide_do_setfeature(ide_drive_t *drive, u8 feature, u8 nsect)
return ide_no_data_taskfile(drive, &cmd);
}

-static void update_ordered(ide_drive_t *drive)
+static void update_flush(ide_drive_t *drive)
{
u16 *id = drive->id;
- unsigned ordered = QUEUE_ORDERED_NONE;
+ unsigned flush = 0;

if (drive->dev_flags & IDE_DFLAG_WCACHE) {
unsigned long long capacity;
[at] [at] -543,13 +543,12 [at] [at] static void update_ordered(ide_drive_t *drive)
drive->name, barrier ? "" : "not ");

if (barrier) {
- ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+ flush = REQ_FLUSH;
blk_queue_prep_rq(drive->queue, idedisk_prep_fn);
}
- } else
- ordered = QUEUE_ORDERED_DRAIN;
+ }

- blk_queue_ordered(drive->queue, ordered);
+ blk_queue_flush(drive->queue, flush);
}

ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE);
[at] [at] -572,7 +571,7 [at] [at] static int set_wcache(ide_drive_t *drive, int arg)
}
}

- update_ordered(drive);
+ update_flush(drive);

return err;
}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a3f21dc..b71cc9e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
[at] [at] -1908,7 +1908,7 [at] [at] static struct mapped_device *alloc_dev(int minor)
blk_queue_softirq_done(md->queue, dm_softirq_done);
blk_queue_prep_rq(md->queue, dm_prep_fn);
blk_queue_lld_busy(md->queue, dm_lld_busy);
- blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(md->queue, REQ_FLUSH);

md->disk = alloc_disk(1);
if (!md->disk)
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index c77eb49..d791772 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
[at] [at] -128,7 +128,6 [at] [at] int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
mq->req = NULL;

blk_queue_prep_rq(mq->queue, mmc_prep_request);
- blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);

#ifdef CONFIG_MMC_BLOCK_BOUNCE
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index 1a84fae..29046b7 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
[at] [at] -2197,7 +2197,6 [at] [at] static void dasd_setup_queue(struct dasd_block *block)
*/
blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
- blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN);
}

/*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 05a15b0..7f6aca2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
[at] [at] -2109,7 +2109,7 [at] [at] static int sd_revalidate_disk(struct gendisk *disk)
struct scsi_disk *sdkp = scsi_disk(disk);
struct scsi_device *sdp = sdkp->device;
unsigned char *buffer;
- unsigned ordered;
+ unsigned flush = 0;

SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp,
"sd_revalidate_disk\n"));
[at] [at] -2151,15 +2151,15 [at] [at] static int sd_revalidate_disk(struct gendisk *disk)

/*
* We now have all cache related info, determine how we deal
- * with ordered requests.
+ * with flush requests.
*/
- if (sdkp->WCE)
- ordered = sdkp->DPOFUA
- ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
- else
- ordered = QUEUE_ORDERED_DRAIN;
+ if (sdkp->WCE) {
+ flush |= REQ_FLUSH;
+ if (sdkp->DPOFUA)
+ flush |= REQ_FUA;
+ }

- blk_queue_ordered(sdkp->disk->queue, ordered);
+ blk_queue_flush(sdkp->disk->queue, flush);

set_capacity(disk, sdkp->capacity);
kfree(buffer);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 96ef5f1..6003f7c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
[at] [at] -355,8 +355,10 [at] [at] struct request_queue
struct blk_trace *blk_trace;
#endif
/*
- * reserved for flush operations
+ * for flush operations
*/
+ unsigned int flush_flags;
+
unsigned int ordered, next_ordered, ordseq;
int orderr, ordcolor;
struct request pre_flush_rq, bar_rq, post_flush_rq;
[at] [at] -863,8 +865,8 [at] [at] extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
+extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern int blk_queue_ordered(struct request_queue *, unsigned);
extern bool blk_do_ordered(struct request_queue *, struct request **);
extern unsigned blk_ordered_cur_seq(struct request_queue *);
extern unsigned blk_ordered_req_seq(struct request *);
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Do, 12 August 2010 14:41 ] [ ID #2045916 ]

[PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG

Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot. Kill ORDERED_BY_TAG. The
following users are affected.

* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated. Removed.
* xen-blkfront: ORDERED_TAG case dropped.

Signed-off-by: Tejun Heo <tj [at] kernel.org>
Cc: Christoph Hellwig <hch [at] infradead.org>
Cc: Nick Piggin <npiggin [at] kernel.dk>
Cc: Michael S. Tsirkin <mst [at] redhat.com>
Cc: Jeremy Fitzhardinge <jeremy [at] xensource.com>
Cc: Chris Wright <chrisw [at] sous-sol.org>
---
block/blk-barrier.c | 35 +++++++----------------------------
drivers/block/brd.c | 2 +-
drivers/block/virtio_blk.c | 9 ---------
drivers/block/xen-blkfront.c | 8 +++-----
drivers/scsi/sd.c | 4 +---
include/linux/blkdev.h | 17 +----------------
6 files changed, 13 insertions(+), 62 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f0faefc..c807e9c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
[at] [at] -26,10 +26,7 [at] [at] int blk_queue_ordered(struct request_queue *q, unsigned ordered)
if (ordered != QUEUE_ORDERED_NONE &&
ordered != QUEUE_ORDERED_DRAIN &&
ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
- ordered != QUEUE_ORDERED_DRAIN_FUA &&
- ordered != QUEUE_ORDERED_TAG &&
- ordered != QUEUE_ORDERED_TAG_FLUSH &&
- ordered != QUEUE_ORDERED_TAG_FUA) {
+ ordered != QUEUE_ORDERED_DRAIN_FUA) {
printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
return -EINVAL;
}
[at] [at] -155,21 +152,9 [at] [at] static inline bool start_ordered(struct request_queue *q, struct request **rqp)
* For an empty barrier, there's no actual BAR request, which
* in turn makes POSTFLUSH unnecessary. Mask them off.
*/
- if (!blk_rq_sectors(rq)) {
+ if (!blk_rq_sectors(rq))
q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
QUEUE_ORDERED_DO_POSTFLUSH);
- /*
- * Empty barrier on a write-through device w/ ordered
- * tag has no command to issue and without any command
- * to issue, ordering by tag can't be used. Drain
- * instead.
- */
- if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
- !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) {
- q->ordered &= ~QUEUE_ORDERED_BY_TAG;
- q->ordered |= QUEUE_ORDERED_BY_DRAIN;
- }
- }

/* stash away the original request */
blk_dequeue_request(rq);
[at] [at] -210,7 +195,7 [at] [at] static inline bool start_ordered(struct request_queue *q, struct request **rqp)
} else
skip |= QUEUE_ORDSEQ_PREFLUSH;

- if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
+ if (queue_in_flight(q))
rq = NULL;
else
skip |= QUEUE_ORDSEQ_DRAIN;
[at] [at] -257,16 +242,10 [at] [at] bool blk_do_ordered(struct request_queue *q, struct request **rqp)
rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
return true;

- if (q->ordered & QUEUE_ORDERED_BY_TAG) {
- /* Ordered by tag. Blocking the next barrier is enough. */
- if (is_barrier && rq != &q->bar_rq)
- *rqp = NULL;
- } else {
- /* Ordered by draining. Wait for turn. */
- WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
- if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- *rqp = NULL;
- }
+ /* Ordered by draining. Wait for turn. */
+ WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
+ if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
+ *rqp = NULL;

return true;
}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 1c7f637..47a4127 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
[at] [at] -482,7 +482,7 [at] [at] static struct brd_device *brd_alloc(int i)
if (!brd->brd_queue)
goto out_free_dev;
blk_queue_make_request(brd->brd_queue, brd_make_request);
- blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG);
+ blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2aafafc..7965280 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
[at] [at] -395,15 +395,6 [at] [at] static int __devinit virtblk_probe(struct virtio_device *vdev)
* to implement write barrier support.
*/
blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
- } else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) {
- /*
- * If the BARRIER feature is supported the host expects us
- * to order request by tags. This implies there is not
- * volatile write cache on the host, and that the host
- * never re-orders outstanding I/O. This feature is not
- * useful for real life scenarious and deprecated.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_TAG);
} else {
/*
* If the FLUSH feature is not supported we must assume that
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 510ab86..25ffbf9 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
[at] [at] -424,8 +424,7 [at] [at] static int xlvbd_barrier(struct blkfront_info *info)
const char *barrier;

switch (info->feature_barrier) {
- case QUEUE_ORDERED_DRAIN: barrier = "enabled (drain)"; break;
- case QUEUE_ORDERED_TAG: barrier = "enabled (tag)"; break;
+ case QUEUE_ORDERED_DRAIN: barrier = "enabled"; break;
case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
default: return -EINVAL;
}
[at] [at] -1078,8 +1077,7 [at] [at] static void blkfront_connect(struct blkfront_info *info)
* we're dealing with a very old backend which writes
* synchronously; draining will do what needs to get done.
*
- * If there are barriers, then we can do full queued writes
- * with tagged barriers.
+ * If there are barriers, then we use flush.
*
* If barriers are not supported, then there's no much we can
* do, so just set ordering to NONE.
[at] [at] -1087,7 +1085,7 [at] [at] static void blkfront_connect(struct blkfront_info *info)
if (err)
info->feature_barrier = QUEUE_ORDERED_DRAIN;
else if (barrier)
- info->feature_barrier = QUEUE_ORDERED_TAG;
+ info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
else
info->feature_barrier = QUEUE_ORDERED_NONE;

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8e2e893..05a15b0 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
[at] [at] -2151,9 +2151,7 [at] [at] static int sd_revalidate_disk(struct gendisk *disk)

/*
* We now have all cache related info, determine how we deal
- * with ordered requests. Note that as the current SCSI
- * dispatch function can alter request order, we cannot use
- * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+ * with ordered requests.
*/
if (sdkp->WCE)
ordered = sdkp->DPOFUA
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89c855c..96ef5f1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
[at] [at] -469,12 +469,7 [at] [at] enum {
* DRAIN : ordering by draining is enough
* DRAIN_FLUSH : ordering by draining w/ pre and post flushes
* DRAIN_FUA : ordering by draining w/ pre flush and FUA write
- * TAG : ordering by tag is enough
- * TAG_FLUSH : ordering by tag w/ pre and post flushes
- * TAG_FUA : ordering by tag w/ pre flush and FUA write
*/
- QUEUE_ORDERED_BY_DRAIN = 0x01,
- QUEUE_ORDERED_BY_TAG = 0x02,
QUEUE_ORDERED_DO_PREFLUSH = 0x10,
QUEUE_ORDERED_DO_BAR = 0x20,
QUEUE_ORDERED_DO_POSTFLUSH = 0x40,
[at] [at] -482,8 +477,7 [at] [at] enum {

QUEUE_ORDERED_NONE = 0x00,

- QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_BY_DRAIN |
- QUEUE_ORDERED_DO_BAR,
+ QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_DO_BAR,
QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN |
QUEUE_ORDERED_DO_PREFLUSH |
QUEUE_ORDERED_DO_POSTFLUSH,
[at] [at] -491,15 +485,6 [at] [at] enum {
QUEUE_ORDERED_DO_PREFLUSH |
QUEUE_ORDERED_DO_FUA,

- QUEUE_ORDERED_TAG = QUEUE_ORDERED_BY_TAG |
- QUEUE_ORDERED_DO_BAR,
- QUEUE_ORDERED_TAG_FLUSH = QUEUE_ORDERED_TAG |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_POSTFLUSH,
- QUEUE_ORDERED_TAG_FUA = QUEUE_ORDERED_TAG |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_FUA,
-
/*
* Ordered operation sequence
*/
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Do, 12 August 2010 14:41 ] [ ID #2045917 ]

[PATCH 09/11] block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests

Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
device cache should be flushed before executing the request. REQ_FUA
means that the data in the request should be on non-volatile media on
completion.

Block layer will choose the correct way of implementing the semantics
and execute it. The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests. Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.

* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
blk-flush.c.

* REQ_FLUSH w/o data can also be directly passed to drivers without
sequencing but some drivers assume that zero length requests don't
have rq->bio which isn't true for these requests requiring the use
of proxy requests.

Signed-off-by: Tejun Heo <tj [at] kernel.org>
Cc: Christoph Hellwig <hch [at] infradead.org>
---
block/blk-core.c | 2 +-
block/blk-flush.c | 85 ++++++++++++++++++++++++++----------------------
block/blk.h | 3 ++
include/linux/blkdev.h | 38 +--------------------
4 files changed, 52 insertions(+), 76 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index efe391b..c00ace2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
[at] [at] -1204,7 +1204,7 [at] [at] static int __make_request(struct request_queue *q, struct bio *bio)

spin_lock_irq(q->queue_lock);

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
where = ELEVATOR_INSERT_FRONT;
goto get_rq;
}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index dd87322..452c552 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
[at] [at] -1,5 +1,5 [at] [at]
/*
- * Functions related to barrier IO handling
+ * Functions to sequence FLUSH and FUA writes.
*/
#include <linux/kernel.h>
#include <linux/module.h>
[at] [at] -9,6 +9,15 [at] [at]

#include "blk.h"

+/* FLUSH/FUA sequences */
+enum {
+ QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
+ QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
+ QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
+ QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
+ QUEUE_FSEQ_DONE = (1 << 4),
+};
+
static struct request *queue_next_fseq(struct request_queue *q);

unsigned blk_flush_cur_seq(struct request_queue *q)
[at] [at] -79,6 +88,7 [at] [at] static void queue_flush(struct request_queue *q, struct request *rq,

static struct request *queue_next_fseq(struct request_queue *q)
{
+ struct request *orig_rq = q->orig_flush_rq;
struct request *rq = &q->flush_rq;

switch (blk_flush_cur_seq(q)) {
[at] [at] -87,12 +97,11 [at] [at] static struct request *queue_next_fseq(struct request_queue *q)
break;

case QUEUE_FSEQ_DATA:
- /* initialize proxy request and queue it */
+ /* initialize proxy request, inherit FLUSH/FUA and queue it */
blk_rq_init(q, rq);
- init_request_from_bio(rq, q->orig_flush_rq->bio);
- rq->cmd_flags &= ~REQ_HARDBARRIER;
- if (q->ordered & QUEUE_ORDERED_DO_FUA)
- rq->cmd_flags |= REQ_FUA;
+ init_request_from_bio(rq, orig_rq->bio);
+ rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
+ rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
rq->end_io = flush_data_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
[at] [at] -110,60 +119,58 [at] [at] static struct request *queue_next_fseq(struct request_queue *q)

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
+ unsigned int fflags = q->flush_flags; /* may change, cache it */
+ bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
+ bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
+ bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
unsigned skip = 0;

- if (!(rq->cmd_flags & REQ_HARDBARRIER))
+ /*
+ * Special case. If there's data but flush is not necessary,
+ * the request can be issued directly.
+ *
+ * Flush w/o data should be able to be issued directly too but
+ * currently some drivers assume that rq->bio contains
+ * non-zero data if it isn't NULL and empty FLUSH requests
+ * getting here usually have bio's without data.
+ */
+ if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!has_fua)
+ rq->cmd_flags &= ~REQ_FUA;
return rq;
+ }

+ /*
+ * Sequenced flushes can't be processed in parallel. If
+ * another one is already in progress, queue for later
+ * processing.
+ */
if (q->flush_seq) {
- /*
- * Sequenced flush is already in progress and they
- * can't be processed in parallel. Queue for later
- * processing.
- */
list_move_tail(&rq->queuelist, &q->pending_flushes);
return NULL;
}

- if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
- /*
- * Queue ordering not supported. Terminate
- * with prejudice.
- */
- blk_dequeue_request(rq);
- __blk_end_request_all(rq, -EOPNOTSUPP);
- return NULL;
- }
-
/*
* Start a new flush sequence
*/
q->flush_err = 0;
- q->ordered = q->next_ordered;
q->flush_seq |= QUEUE_FSEQ_STARTED;

- /*
- * For an empty barrier, there's no actual BAR request, which
- * in turn makes POSTFLUSH unnecessary. Mask them off.
- */
- if (!blk_rq_sectors(rq))
- q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
- QUEUE_ORDERED_DO_POSTFLUSH);
-
- /* stash away the original request */
+ /* adjust FLUSH/FUA of the original request and stash it away */
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!has_fua)
+ rq->cmd_flags &= ~REQ_FUA;
blk_dequeue_request(rq);
q->orig_flush_rq = rq;

- if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+ /* skip unneded sequences and return the first one */
+ if (!do_preflush)
skip |= QUEUE_FSEQ_PREFLUSH;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+ if (!blk_rq_sectors(rq))
skip |= QUEUE_FSEQ_DATA;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+ if (!do_postflush)
skip |= QUEUE_FSEQ_POSTFLUSH;
-
- /* complete skipped sequences and return the first sequence */
return blk_flush_complete_seq(q, skip, 0);
}

diff --git a/block/blk.h b/block/blk.h
index 24b92bd..a09c18b 100644
--- a/block/blk.h
+++ b/block/blk.h
[at] [at] -60,6 +60,9 [at] [at] static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
+ if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
+ rq == &q->flush_rq)
+ return rq;
rq = blk_do_flush(q, rq);
if (rq)
return rq;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 87e58f0..5ce0696 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
[at] [at] -357,7 +357,6 [at] [at] struct request_queue
/*
* for flush operations
*/
- unsigned int ordered, next_ordered;
unsigned int flush_flags;
unsigned int flush_seq;
int flush_err;
[at] [at] -464,40 +463,6 [at] [at] static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
__clear_bit(flag, &q->queue_flags);
}

-enum {
- /*
- * Hardbarrier is supported with one of the following methods.
- *
- * NONE : hardbarrier unsupported
- * DRAIN : ordering by draining is enough
- * DRAIN_FLUSH : ordering by draining w/ pre and post flushes
- * DRAIN_FUA : ordering by draining w/ pre flush and FUA write
- */
- QUEUE_ORDERED_DO_PREFLUSH = 0x10,
- QUEUE_ORDERED_DO_BAR = 0x20,
- QUEUE_ORDERED_DO_POSTFLUSH = 0x40,
- QUEUE_ORDERED_DO_FUA = 0x80,
-
- QUEUE_ORDERED_NONE = 0x00,
-
- QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_DO_BAR,
- QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_POSTFLUSH,
- QUEUE_ORDERED_DRAIN_FUA = QUEUE_ORDERED_DRAIN |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_FUA,
-
- /*
- * FLUSH/FUA sequences.
- */
- QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
- QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
- QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
- QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
- QUEUE_FSEQ_DONE = (1 << 4),
-};
-
#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
#define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
[at] [at] -576,7 +541,8 [at] [at] static inline void blk_clear_queue_full(struct request_queue *q, int sync)
* it already be started by driver.
*/
#define RQ_NOMERGE_FLAGS \
- (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER)
+ (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | \
+ REQ_FLUSH | REQ_FUA)
#define rq_mergeable(rq) \
(!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \
(((rq)->cmd_flags & REQ_DISCARD) || \
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Do, 12 August 2010 14:41 ] [ ID #2045918 ]

[PATCH 01/11] block/loop: queue ordered mode should be

loop implements FLUSH using fsync but was incorrectly setting its
ordered mode to DRAIN. Change it to DRAIN_FLUSH. In practice, this
doesn't change anything as loop doesn't make use of the block layer
ordered implementation.

Signed-off-by: Tejun Heo <tj [at] kernel.org>
---
drivers/block/loop.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f3c636d..c3a4a2e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
[at] [at] -832,7 +832,7 [at] [at] static int loop_set_fd(struct loop_device *lo, fmode_t mode,
lo->lo_queue->unplug_fn = loop_unplug;

if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
- blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN);
+ blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);

set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
--
1.7.1
Tejun Heo [ Do, 12 August 2010 14:41 ] [ ID #2045919 ]

[PATCH 05/11] block: misc cleanups in barrier code

Make the following cleanups in preparation of barrier/flush update.

* blk_do_ordered() declaration is moved from include/linux/blkdev.h to
block/blk.h.

* blk_do_ordered() now returns pointer to struct request, with %NULL
meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
later". The third case will be dropped with further changes.

* In the initialization of proxy barrier request, data direction is
already set by init_request_from_bio(). Drop unnecessary explicit
REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
flag setting.

* add_request() is collapsed into __make_request().

These changes don't make any functional difference.

Signed-off-by: Tejun Heo <tj [at] kernel.org>
---
block/blk-barrier.c | 32 ++++++++++++++------------------
block/blk-core.c | 21 ++++-----------------
block/blk.h | 7 +++++--
include/linux/blkdev.h | 1 -
4 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index ed0aba5..f1be85b 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
[at] [at] -110,9 +110,9 [at] [at] static void queue_flush(struct request_queue *q, unsigned which)
elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}

-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static inline struct request *start_ordered(struct request_queue *q,
+ struct request *rq)
{
- struct request *rq = *rqp;
unsigned skip = 0;

q->orderr = 0;
[at] [at] -149,11 +149,9 [at] [at] static inline bool start_ordered(struct request_queue *q, struct request **rqp)

/* initialize proxy request and queue it */
blk_rq_init(q, rq);
- if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
- rq->cmd_flags |= REQ_WRITE;
+ init_request_from_bio(rq, q->orig_bar_rq->bio);
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
- init_request_from_bio(rq, q->orig_bar_rq->bio);
rq->end_io = bar_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
[at] [at] -171,27 +169,26 [at] [at] static inline bool start_ordered(struct request_queue *q, struct request **rqp)
else
skip |= QUEUE_ORDSEQ_DRAIN;

- *rqp = rq;
-
/*
* Complete skipped sequences. If whole sequence is complete,
- * return false to tell elevator that this request is gone.
+ * return %NULL to tell elevator that this request is gone.
*/
- return !blk_ordered_complete_seq(q, skip, 0);
+ if (blk_ordered_complete_seq(q, skip, 0))
+ rq = NULL;
+ return rq;
}

-bool blk_do_ordered(struct request_queue *q, struct request **rqp)
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
{
- struct request *rq = *rqp;
const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
(rq->cmd_flags & REQ_HARDBARRIER);

if (!q->ordseq) {
if (!is_barrier)
- return true;
+ return rq;

if (q->next_ordered != QUEUE_ORDERED_NONE)
- return start_ordered(q, rqp);
+ return start_ordered(q, rq);
else {
/*
* Queue ordering not supported. Terminate
[at] [at] -199,8 +196,7 [at] [at] bool blk_do_ordered(struct request_queue *q, struct request **rqp)
*/
blk_dequeue_request(rq);
__blk_end_request_all(rq, -EOPNOTSUPP);
- *rqp = NULL;
- return false;
+ return NULL;
}
}

[at] [at] -211,14 +207,14 [at] [at] bool blk_do_ordered(struct request_queue *q, struct request **rqp)
/* Special requests are not subject to ordering rules. */
if (rq->cmd_type != REQ_TYPE_FS &&
rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
- return true;
+ return rq;

/* Ordered by draining. Wait for turn. */
WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- *rqp = NULL;
+ rq = ERR_PTR(-EAGAIN);

- return true;
+ return rq;
}

static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk-core.c b/block/blk-core.c
index 3f802dd..ed8ef89 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
[at] [at] -1037,22 +1037,6 [at] [at] void blk_insert_request(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL(blk_insert_request);

-/*
- * add-request adds a request to the linked list.
- * queue lock is held and interrupts disabled, as we muck with the
- * request queue list.
- */
-static inline void add_request(struct request_queue *q, struct request *req)
-{
- drive_stat_acct(req, 1);
-
- /*
- * elevator indicated where it wants this request to be
- * inserted at elevator_merge time
- */
- __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
-}
-
static void part_round_stats_single(int cpu, struct hd_struct *part,
unsigned long now)
{
[at] [at] -1316,7 +1300,10 [at] [at] get_rq:
req->cpu = blk_cpu_to_group(smp_processor_id());
if (queue_should_plug(q) && elv_queue_empty(q))
blk_plug_device(q);
- add_request(q, req);
+
+ /* insert the request into the elevator */
+ drive_stat_acct(req, 1);
+ __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
out:
if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
diff --git a/block/blk.h b/block/blk.h
index 6e7dc87..874eb4e 100644
--- a/block/blk.h
+++ b/block/blk.h
[at] [at] -51,6 +51,8 [at] [at] static inline void blk_clear_rq_complete(struct request *rq)
*/
#define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash))

+struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
[at] [at] -58,8 +60,9 [at] [at] static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (blk_do_ordered(q, &rq))
- return rq;
+ rq = blk_do_ordered(q, rq);
+ if (rq)
+ return !IS_ERR(rq) ? rq : NULL;
}

if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6003f7c..21baa19 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
[at] [at] -867,7 +867,6 [at] [at] extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern bool blk_do_ordered(struct request_queue *, struct request **);
extern unsigned blk_ordered_cur_seq(struct request_queue *);
extern unsigned blk_ordered_req_seq(struct request *);
extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);
--
1.7.1
Tejun Heo [ Do, 12 August 2010 14:41 ] [ ID #2045920 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

The patchset looks functionally correct to me, and with a small patch
to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
underlying qemu shows that we actually get the flush requests where we should.

No performance or power fail testing done yet.

But I do not like the transition very much. The new WRITE_FUA_FLUSH
request is exactly what filesystems expect from a current barrier
request, so I'd rather move to that functionality without breaking stuff
inbetween.

So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
a main one to relax barrier semantics, then have the renaming patches 7
and 8, and possible keep patch 11 separate from the main implementation
change, and if absolutely also a separate one to introduce REQ_FUA and
REQ_FLUSH in the bio interface, but keep things working while doing
this.

Then we can patches do disable the reiserfs barrier "optimization" as
the very first one, and DM/MD support which I'm currently working on
as the last one and we can start doing the heavy testing.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Fr, 13 August 2010 13:48 ] [ ID #2045965 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Tejun Heo, on 08/12/2010 04:41 PM wrote:
> Each filesystem needs to be updated to enforce request
> ordering themselves and then to use REQ_FLUSH/FUA mechanism.

I generally agree with the patchset, but I believe this particular move
is a really bad move.

I'm not mentioning the obvious that a common functionality (enforcing
requests ordering in this case) should be handled by a common library,
but not internally by a zillion file systems Linux has.

The worst in this move is that it would hide all the requests ordering
semantic inside file systems in, most likely, a very much unclear way.
That would lead that if I or someone else decide to implement the
"hardware offload" of requests ordering (ORDERED requests), I or he/she
would not be able to see any improvement until at least one file system
be changed to be able to use it. Worse, if the implementor can't
demonstrate the improvement, how can he encourage file systems
developers to update their file systems? Which, basically, would mean
that only a person with *BOTH* deep storage and file systems internals
knowledge can do the job. How many do you know such people? Both storage
and file systems topics are very wide and tricky, so nearly always
people specialize in one of them, not both.

Thus, this move would basically mean that the proper ordered queuing
would probably never be implemented in Linux.

I believe, much better would be to create a common interface, which file
systems would use to enforce requests order, when they need it.

Advantages of this approach:

1. The ordering requirements of file systems would be clear.

2. They would be handled in one place by a common code.

3. Any storage level expert can try to implement ordered queuing without
a deep dive into file systems design and implementation.

I already suggested such interface in
http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Internally for the
moment it can be implemented using existing REQ_FLUSH/FUA/etc. and
waiting for all the requests in the group to finish. As a nice side
effect, if a device doesn't support FUA, it would be possible to issue
SYNC_CACHE command(s) only for required blocks, not for the whole device
as it is done now.

If requested, I can develop the interface further.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin [ Fr, 13 August 2010 14:55 ] [ ID #2045966 ]

Re: [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG

Hello Tejun,

Tejun Heo, on 08/12/2010 04:41 PM wrote:
> Nobody is making meaningful use of ORDERED_BY_TAG now and queue
> draining for barrier requests will be removed soon which will render
> the advantage of tag ordering moot.

Have you seen Hannes Reinecke's and my measurements in
http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and
http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly?

If yes, what else evidences do you need to see that the tag ordering is
a big performance win?

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin [ Fr, 13 August 2010 14:56 ] [ ID #2045967 ]

Re: [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG

On Fri, Aug 13, 2010 at 04:56:32PM +0400, Vladislav Bolkhovitin wrote:
> Tejun Heo, on 08/12/2010 04:41 PM wrote:
> >Nobody is making meaningful use of ORDERED_BY_TAG now and queue
> >draining for barrier requests will be removed soon which will render
> >the advantage of tag ordering moot.
>
> Have you seen Hannes Reinecke's and my measurements in
> http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and
> http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly?
>
> If yes, what else evidences do you need to see that the tag ordering is
> a big performance win?

It's not tag odering that is a win but big queue depth. That's what you
measured and what I fully agree on. I haven't been able to get out of
Hannes what he actually measured.

And if you'd actually look at the patchset allowing deep queues is
exactly what it allows us, and while I haven't done testing on this
patchset but only on my previous version it does get us back to use
the full potential of large arrays exactly because of that.
Christoph Hellwig [ Fr, 13 August 2010 15:06 ] [ ID #2045968 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 04:55:33PM +0400, Vladislav Bolkhovitin wrote:
> I'm not mentioning the obvious that a common functionality (enforcing
> requests ordering in this case) should be handled by a common library,
> but not internally by a zillion file systems Linux has.

I/O ordering is still handled mostly by common code, that is the
pagecache and the buffercache, although a few filesystems like XFS and
btrfs have their own implementation of the second one.

The current ordered semantics of barriers have only successfull
implemented by a complete queue drain, and not effectively been used
by filesystems. This patchset removes the bogus global ordering
enforced by the block layer whenever a filesystems wants to be able
to use cache flushes, and because of that allows deeper outstanding
queue depth I/O with less latency.

Now I know you in particular are a fan of scsi ordered tags. And as I
told you before I'm open to review such an implementation if it shows
us any advantages. Adding it after this patch is in fact not any more
complicated than before, I'd almost be tempted it's easier as you don't
have to plug it into the complex state machine we used for barriers, and
more importantly we drop the requirement for the barrier sequence to
be atomic, which in fact made implementing barriers using tagged queues
impossible with the current scsi layer.

As far as playing with ordered tags it's just adding a new flag for
it on the bio that gets passed down to the driver. For a final version
you'd need a queue-level feature if it's supported, but you don't
even need that for the initial work. Then you can implement a
variant of blk_do_flush that does away with queueing additional requests
once finish but queues all two or three at the same time with your
new ordered flag set, at which point you are back to the level or
ordered tag usage that the old code allows. You're still left with
all the hard problems of actually implementing error handling for it
and using it higher up in the filesystem and generic page cache code.

I'd really love to see your results, up to the point of just trying
that once I get a little spare time. But my theory is that it won't
help us - the problem with ordered tags is that they enforce global
ordering while we currently have local ordering. While it will reduce
the latency for the process waiting for an fsync or similar it will
affect other I/O going on in the background and reduce the devices
ability to reorder that I/O.

So for now this patch set is a massive improvement of performance for
workloads we care about, while removing the interface we put in place
to allow a theoretical optimization that didn't show up for 8 years
before, and in fact made the interface just complicated enough to make
that optimization so hard.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Fr, 13 August 2010 15:17 ] [ ID #2045969 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/13/2010 02:55 PM, Vladislav Bolkhovitin wrote:
> If requested, I can develop the interface further.

I still think the benefit of ordering by tag would be marginal at
best, and what have you guys measured there? Under the current
framework, there's no easy way to measure full ordered-by-tag
implementation. The mechanism for filesystems to communicate the
ordering information (which would be a partially ordered graph) just
isn't there and there is no way the current usage of ordering-by-tag
only for barrier sequence can achieve anything close to that level of
difference.

Ripping out the original ordering by tag mechanism doesn't amount to
much. The use of ordering-by-tag was pretty half-assed there anyway.
If you think exporting full ordering information from filesystem to
the lower layers is worthwhile, please go ahead. It would be very
interesting to see how much actual difference it can make compared to
ordering-by-filesystem and if it's actually better and the added
complexity is manageable, there's no reason not to do that.

Thank you.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Fr, 13 August 2010 15:21 ] [ ID #2045970 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello, Christoph.

On 08/13/2010 01:48 PM, Christoph Hellwig wrote:
> The patchset looks functionally correct to me, and with a small patch
> to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
> underlying qemu shows that we actually get the flush requests where we should.

Great.

> No performance or power fail testing done yet.
>
> But I do not like the transition very much. The new WRITE_FUA_FLUSH
> request is exactly what filesystems expect from a current barrier
> request, so I'd rather move to that functionality without breaking stuff
> inbetween.
>
> So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
> a main one to relax barrier semantics, then have the renaming patches 7
> and 8, and possible keep patch 11 separate from the main implementation
> change, and if absolutely also a separate one to introduce REQ_FUA and
> REQ_FLUSH in the bio interface, but keep things working while doing
> this.

There are two reason to avoid changing the meaning of REQ_HARDBARRIER
and just deprecate it. One is to avoid breaking filesystems'
expectations underneath it. Please note that there are out-of-tree
filesystems too. I think it would be too dangerous to relax
REQ_HARDBARRIER.

Another is that pseudo block layer drivers (loop, virtio_blk,
md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
would be broken in obscure ways between REQ_HARDBARRIER semantics
change and updates to each of those drivers, so I don't really think
changing the semantics while the mechanism is online is a good idea.

> Then we can patches do disable the reiserfs barrier "optimization" as
> the very first one, and DM/MD support which I'm currently working on
> as the last one and we can start doing the heavy testing.

Oops, I've already converted loop, virtio_blk/lguest and am working on
md/dm right now too. I'm almost done with md and now doing dm. :-)
Maybe we should post them right now so that we don't waste too much
time trying to solve the same problems?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Fr, 13 August 2010 15:48 ] [ ID #2045971 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
> and just deprecate it. One is to avoid breaking filesystems'
> expectations underneath it. Please note that there are out-of-tree
> filesystems too. I think it would be too dangerous to relax
> REQ_HARDBARRIER.

Note that the renaming patch would include a move from REQ_HARDBARRIER
to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
compile. And while out of tree filesystems do exist they it's their
problem to keep up with kernel changes. They decide not to be part
of the Linux kernel, so it'll be their job to keep up with it.

> Another is that pseudo block layer drivers (loop, virtio_blk,
> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
> would be broken in obscure ways between REQ_HARDBARRIER semantics
> change and updates to each of those drivers, so I don't really think
> changing the semantics while the mechanism is online is a good idea.

I don't think doing those changes in a separate commit is a good idea.

> > Then we can patches do disable the reiserfs barrier "optimization" as
> > the very first one, and DM/MD support which I'm currently working on
> > as the last one and we can start doing the heavy testing.
>
> Oops, I've already converted loop, virtio_blk/lguest and am working on
> md/dm right now too. I'm almost done with md and now doing dm. :-)
> Maybe we should post them right now so that we don't waste too much
> time trying to solve the same problems?

Here's the dm patch. It only handles normal bio based dm yet, which
I understand and can test. request based dm (multipath) still needs
work.


Index: linux-2.6/drivers/md/dm-crypt.c
============================================================ =======
--- linux-2.6.orig/drivers/md/dm-crypt.c 2010-08-13 16:11:04.207010218 +0200
+++ linux-2.6/drivers/md/dm-crypt.c 2010-08-13 16:11:10.048003862 +0200
[at] [at] -1249,7 +1249,7 [at] [at] static int crypt_map(struct dm_target *t
struct dm_crypt_io *io;
struct crypt_config *cc;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm-io.c
============================================================ =======
--- linux-2.6.orig/drivers/md/dm-io.c 2010-08-13 16:11:04.213011894 +0200
+++ linux-2.6/drivers/md/dm-io.c 2010-08-13 16:11:10.049003792 +0200
[at] [at] -364,7 +364,7 [at] [at] static void dispatch_io(int rw, unsigned
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

[at] [at] -412,8 +412,8 [at] [at] retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
+ if (io->eopnotsupp_bits && (rw & REQ_FLUSH)) {
+ rw &= ~REQ_FLUSH;
goto retry;
}

Index: linux-2.6/drivers/md/dm-raid1.c
============================================================ =======
--- linux-2.6.orig/drivers/md/dm-raid1.c 2010-08-13 16:11:04.220013431 +0200
+++ linux-2.6/drivers/md/dm-raid1.c 2010-08-13 16:11:10.054018319 +0200
[at] [at] -670,7 +670,7 [at] [at] static void do_writes(struct mirror_set
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
bio_list_add(&sync, bio);
continue;
}
[at] [at] -1203,7 +1203,7 [at] [at] static int mirror_end_io(struct dm_targe
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!bio_empty_flush(bio))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
Index: linux-2.6/drivers/md/dm-region-hash.c
============================================================ =======
--- linux-2.6.orig/drivers/md/dm-region-hash.c 2010-08-13 16:11:04.228004631 +0200
+++ linux-2.6/drivers/md/dm-region-hash.c 2010-08-13 16:11:10.060003932 +0200
[at] [at] -399,7 +399,7 [at] [at] void dm_rh_mark_nosync(struct dm_region_
region_t region = dm_rh_bio_to_region(rh, bio);
int recovering = 0;

- if (bio_empty_barrier(bio)) {
+ if (bio_empty_flush(bio)) {
rh->barrier_failure = 1;
return;
}
[at] [at] -524,7 +524,7 [at] [at] void dm_rh_inc_pending(struct dm_region_
struct bio *bio;

for (bio = bios->head; bio; bio = bio->bi_next) {
- if (bio_empty_barrier(bio))
+ if (bio_empty_flush(bio))
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
Index: linux-2.6/drivers/md/dm-snap.c
============================================================ =======
--- linux-2.6.orig/drivers/md/dm-snap.c 2010-08-13 16:11:04.238004701 +0200
+++ linux-2.6/drivers/md/dm-snap.c 2010-08-13 16:11:10.067005677 +0200
[at] [at] -1581,7 +1581,7 [at] [at] static int snapshot_map(struct dm_target
chunk_t chunk;
struct dm_snap_pending_exception *pe = NULL;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
bio->bi_bdev = s->cow->bdev;
return DM_MAPIO_REMAPPED;
}
[at] [at] -1685,7 +1685,7 [at] [at] static int snapshot_merge_map(struct dm_
int r = DM_MAPIO_REMAPPED;
chunk_t chunk;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
if (!map_context->flush_request)
bio->bi_bdev = s->origin->bdev;
else
[at] [at] -2123,7 +2123,7 [at] [at] static int origin_map(struct dm_target *
struct dm_dev *dev = ti->private;
bio->bi_bdev = dev->bdev;

- if (unlikely(bio_empty_barrier(bio)))
+ if (bio_empty_flush(bio))
return DM_MAPIO_REMAPPED;

/* Only tell snapshots if this is a write */
Index: linux-2.6/drivers/md/dm-stripe.c
============================================================ =======
--- linux-2.6.orig/drivers/md/dm-stripe.c 2010-08-13 16:11:04.247011266 +0200
+++ linux-2.6/drivers/md/dm-stripe.c 2010-08-13 16:11:10.072026629 +0200
[at] [at] -214,7 +214,7 [at] [at] static int stripe_map(struct dm_target *
sector_t offset, chunk;
uint32_t stripe;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
BUG_ON(map_context->flush_request >= sc->stripes);
bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev;
return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm.c
============================================================ =======
--- linux-2.6.orig/drivers/md/dm.c 2010-08-13 16:11:04.256004631 +0200
+++ linux-2.6/drivers/md/dm.c 2010-08-13 16:11:37.152005462 +0200
[at] [at] -139,17 +139,6 [at] [at] struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
- */
- int barrier_error;
-
- /*
- * Protect barrier_error from concurrent endio processing
- * in request-based dm.
- */
- spinlock_t barrier_error_lock;
-
- /*
* Processing queue (flush/barriers)
*/
struct workqueue_struct *wq;
[at] [at] -194,9 +183,6 [at] [at] struct mapped_device {

/* sysfs handle */
struct kobject kobj;
-
- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
};

/*
[at] [at] -505,10 +491,6 [at] [at] static void end_io_acct(struct dm_io *io
part_stat_add(cpu, &dm_disk(md)->part0, ticks[rw], duration);
part_stat_unlock();

- /*
- * After this is decremented the bio must not be touched if it is
- * a barrier.
- */
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
pending += atomic_read(&md->pending[rw^0x1]);
[at] [at] -621,7 +603,7 [at] [at] static void dec_pending(struct dm_io *io
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & (REQ_FLUSH|REQ_FUA)))
bio_list_add_head(&md->deferred,
io->bio);
} else
[at] [at] -633,25 +615,13 [at] [at] static void dec_pending(struct dm_io *io
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
- /*
- * There can be just one barrier request so we use
- * a per-device variable for error reporting.
- * Note that you can't touch the bio after end_io_acct
- */
- if (!md->barrier_error && io_error != -EOPNOTSUPP)
- md->barrier_error = io_error;
- end_io_acct(io);
- free_io(md, io);
- } else {
- end_io_acct(io);
- free_io(md, io);
+ end_io_acct(io);
+ free_io(md, io);

- if (io_error != DM_ENDIO_REQUEUE) {
- trace_block_bio_complete(md->queue, bio);
+ if (io_error != DM_ENDIO_REQUEUE) {
+ trace_block_bio_complete(md->queue, bio);

- bio_endio(bio, io_error);
- }
+ bio_endio(bio, io_error);
}
}
}
[at] [at] -744,23 +714,6 [at] [at] static void end_clone_bio(struct bio *cl
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&md->barrier_error_lock, flags);
- /*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
/*
* Don't touch any member of the md after calling this function because
* the md may be freed in dm_put() at the end of this function.
[at] [at] -798,13 +751,11 [at] [at] static void free_rq_clone(struct request
static void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
- int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

[at] [at] -818,15 +769,8 [at] [at] static void dm_end_request(struct reques
}

free_rq_clone(clone);
-
- if (unlikely(is_barrier)) {
- if (unlikely(error))
- store_barrier_error(md, error);
- run_queue = 0;
- } else
- blk_end_request_all(rq, error);
-
- rq_completed(md, rw, run_queue);
+ blk_end_request_all(rq, error);
+ rq_completed(md, rw, 1);
}

static void dm_unprep_request(struct request *rq)
[at] [at] -1113,7 +1057,7 [at] [at] static struct bio *split_bvec(struct bio

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
[at] [at] -1140,7 +1084,6 [at] [at] static struct bio *clone_bio(struct bio

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
[at] [at] -1186,7 +1129,7 [at] [at] static void __flush_target(struct clone_
__map_bio(ti, clone, tio);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_empty_flush(struct clone_info *ci)
{
unsigned target_nr = 0, flush_nr;
struct dm_target *ti;
[at] [at] -1208,8 +1151,8 [at] [at] static int __clone_and_map(struct clone_
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
+ if (bio_empty_flush(bio))
+ return __clone_and_map_empty_flush(ci);

ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
[at] [at] -1308,11 +1251,7 [at] [at] static void __split_and_process_bio(stru

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
- bio_io_error(bio);
- else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ bio_io_error(bio);
return;
}

[at] [at] -1326,7 +1265,7 [at] [at] static void __split_and_process_bio(stru
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (bio_empty_flush(bio))
ci.sector_count = 1;
ci.idx = bio->bi_idx;

[at] [at] -1420,8 +1359,7 [at] [at] static int _dm_request(struct request_qu
* If we're suspended or the thread is processing barriers
* we have to queue this io for later.
*/
- if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
[at] [at] -1873,7 +1811,6 [at] [at] static struct mapped_device *alloc_dev(i
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
[at] [at] -2233,38 +2170,6 [at] [at] static int dm_wait_for_completion(struct
return r;
}

-static void dm_flush(struct mapped_device *md)
-{
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
-
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;
-
- dm_flush(md);
-
- if (!bio_empty_barrier(bio)) {
- __split_and_process_bio(md, bio);
- dm_flush(md);
- }
-
- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
- else {
- spin_lock_irq(&md->deferred_lock);
- bio_list_add_head(&md->deferred, bio);
- spin_unlock_irq(&md->deferred_lock);
- }
-}
-
/*
* Process the deferred bios
*/
[at] [at] -2290,12 +2195,8 [at] [at] static void dm_wq_work(struct work_struc

if (dm_request_based(md))
generic_make_request(c);
- else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
- else
- __split_and_process_bio(md, c);
- }
+ else
+ __split_and_process_bio(md, c);

down_write(&md->io_lock);
}
[at] [at] -2326,8 +2227,6 [at] [at] static int dm_rq_barrier(struct mapped_d
struct dm_target *ti;
struct request *clone;

- md->barrier_error = 0;
-
for (i = 0; i < num_targets; i++) {
ti = dm_table_get_target(map, i);
for (j = 0; j < ti->num_flush_requests; j++) {
[at] [at] -2341,7 +2240,7 [at] [at] static int dm_rq_barrier(struct mapped_d
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
dm_table_put(map);

- return md->barrier_error;
+ return 0;
}

static void dm_rq_barrier_work(struct work_struct *work)
Index: linux-2.6/include/linux/bio.h
============================================================ =======
--- linux-2.6.orig/include/linux/bio.h 2010-08-13 16:11:04.268004351 +0200
+++ linux-2.6/include/linux/bio.h 2010-08-13 16:11:10.082005677 +0200
[at] [at] -66,8 +66,8 [at] [at]
#define bio_offset(bio) bio_iovec((bio))->bv_offset
#define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx)
#define bio_sectors(bio) ((bio)->bi_size >> 9)
-#define bio_empty_barrier(bio) \
- ((bio->bi_rw & REQ_HARDBARRIER) && \
+#define bio_empty_flush(bio) \
+ ((bio->bi_rw & REQ_FLUSH) && \
!bio_has_data(bio) && \
!(bio->bi_rw & REQ_DISCARD))

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Fr, 13 August 2010 16:38 ] [ ID #2045972 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/13/2010 04:38 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
>> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
>> and just deprecate it. One is to avoid breaking filesystems'
>> expectations underneath it. Please note that there are out-of-tree
>> filesystems too. I think it would be too dangerous to relax
>> REQ_HARDBARRIER.
>
> Note that the renaming patch would include a move from REQ_HARDBARRIER
> to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
> compile. And while out of tree filesystems do exist they it's their
> problem to keep up with kernel changes. They decide not to be part
> of the Linux kernel, so it'll be their job to keep up with it.

Oh, right, we can simply remove REQ_HARDBARRIER completely.

>> Another is that pseudo block layer drivers (loop, virtio_blk,
>> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
>> would be broken in obscure ways between REQ_HARDBARRIER semantics
>> change and updates to each of those drivers, so I don't really think
>> changing the semantics while the mechanism is online is a good idea.
>
> I don't think doing those changes in a separate commit is a good idea.

Do you want to change the whole thing in a single commit? That would
be a pretty big invasive patch touching multiple subsystems. Also, I
don't know what to do about drdb and would like to leave its
conversion to the maintainer (in separate patches).

Eh, well, this is mostly logistics. Jens, what do you think?

>>> Then we can patches do disable the reiserfs barrier "optimization" as
>>> the very first one, and DM/MD support which I'm currently working on
>>> as the last one and we can start doing the heavy testing.
>>
>> Oops, I've already converted loop, virtio_blk/lguest and am working on
>> md/dm right now too. I'm almost done with md and now doing dm. :-)
>> Maybe we should post them right now so that we don't waste too much
>> time trying to solve the same problems?
>
> Here's the dm patch. It only handles normal bio based dm yet, which
> I understand and can test. request based dm (multipath) still needs
> work.

Here's the combined patch I've been working on. I've verified loop
and virtio_blk/loop. I just (like five mins ago) got dm/dm conversion
compiling, so I'm sure they're broken. The neat part is that thanks
to the separation between REQ_FLUSH and FUA handling, bio mangling
drivers only have to sequence the pre-flush and pass FUA directly to
lower layers which in many cases saves an array-wide cache flush
cycle.

After getting this patch working, the only remaining bits would be
blktrace and drdb.

Thanks.

Documentation/lguest/lguest.c | 36 +++-----
drivers/block/loop.c | 18 ++--
drivers/block/virtio_blk.c | 26 ++---
drivers/md/dm-io.c | 20 ----
drivers/md/dm-log.c | 2
drivers/md/dm-raid1.c | 8 -
drivers/md/dm-snap-persistent.c | 2
drivers/md/dm.c | 176 +++++++++++++++++++--------------------
drivers/md/linear.c | 4
drivers/md/md.c | 117 +++++---------------------
drivers/md/md.h | 23 +----
drivers/md/multipath.c | 4
drivers/md/raid0.c | 4
drivers/md/raid1.c | 178 +++++++++++++---------------------------
drivers/md/raid1.h | 2
drivers/md/raid10.c | 6 -
drivers/md/raid5.c | 18 +---
include/linux/virtio_blk.h | 6 +
18 files changed, 244 insertions(+), 406 deletions(-)

Index: block/drivers/block/loop.c
============================================================ =======
--- block.orig/drivers/block/loop.c
+++ block/drivers/block/loop.c
[at] [at] -477,17 +477,17 [at] [at] static int do_bio_filebacked(struct loop
pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

if (bio_rw(bio) == WRITE) {
- bool barrier = (bio->bi_rw & REQ_HARDBARRIER);
struct file *file = lo->lo_backing_file;

- if (barrier) {
- if (unlikely(!file->f_op->fsync)) {
- ret = -EOPNOTSUPP;
- goto out;
- }
+ /* REQ_HARDBARRIER is deprecated */
+ if (bio->bi_rw & REQ_HARDBARRIER) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }

+ if (bio->bi_rw & REQ_FLUSH) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret)) {
+ if (unlikely(ret && ret != -EINVAL)) {
ret = -EIO;
goto out;
}
[at] [at] -495,9 +495,9 [at] [at] static int do_bio_filebacked(struct loop

ret = lo_send(lo, bio, pos);

- if (barrier && !ret) {
+ if ((bio->bi_rw & REQ_FUA) && !ret) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret))
+ if (unlikely(ret && ret != -EINVAL))
ret = -EIO;
}
} else
Index: block/drivers/block/virtio_blk.c
============================================================ =======
--- block.orig/drivers/block/virtio_blk.c
+++ block/drivers/block/virtio_blk.c
[at] [at] -128,9 +128,6 [at] [at] static bool do_req(struct request_queue
}
}

- if (vbr->req->cmd_flags & REQ_HARDBARRIER)
- vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));

/*
[at] [at] -157,6 +154,8 [at] [at] static bool do_req(struct request_queue
if (rq_data_dir(vbr->req) == WRITE) {
vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
out += num;
+ if (req->cmd_flags & REQ_FUA)
+ vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
} else {
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
in += num;
[at] [at] -307,6 +306,7 [at] [at] static int __devinit virtblk_probe(struc
{
struct virtio_blk *vblk;
struct request_queue *q;
+ unsigned int flush;
int err;
u64 cap;
u32 v, blk_size, sg_elems, opt_io_size;
[at] [at] -388,15 +388,13 [at] [at] static int __devinit virtblk_probe(struc
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that to
- * implement write barrier support; otherwise, we must assume
- * that the host does not perform any kind of volatile write
- * caching.
- */
+ /* configure queue flush support */
+ flush = 0;
if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
- blk_queue_flush(q, REQ_FLUSH);
+ flush |= REQ_FLUSH;
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FUA))
+ flush |= REQ_FUA;
+ blk_queue_flush(q, flush);

/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
[at] [at] -515,9 +513,9 [at] [at] static const struct virtio_device_id id_
};

static unsigned int features[] = {
- VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
- VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
- VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+ VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+ VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_FUA,
};

/*
Index: block/include/linux/virtio_blk.h
============================================================ =======
--- block.orig/include/linux/virtio_blk.h
+++ block/include/linux/virtio_blk.h
[at] [at] -16,6 +16,7 [at] [at]
#define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */
#define VIRTIO_BLK_F_FLUSH 9 /* Cache flush command support */
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
+#define VIRTIO_BLK_F_FUA 11 /* Forced Unit Access write support */

#define VIRTIO_BLK_ID_BYTES 20 /* ID string length */

[at] [at] -70,7 +71,10 [at] [at] struct virtio_blk_config {
#define VIRTIO_BLK_T_FLUSH 4

/* Get device ID command */
-#define VIRTIO_BLK_T_GET_ID 8
+#define VIRTIO_BLK_T_GET_ID 8
+
+/* FUA command */
+#define VIRTIO_BLK_T_FUA 16

/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
Index: block/Documentation/lguest/lguest.c
============================================================ =======
--- block.orig/Documentation/lguest/lguest.c
+++ block/Documentation/lguest/lguest.c
[at] [at] -1639,15 +1639,6 [at] [at] static void blk_request(struct virtqueue
off = out->sector * 512;

/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
[at] [at] -1679,6 +1670,19 [at] [at] static void blk_request(struct virtqueue
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ /* Honor FUA by syncing everything. */
+ if (ret >= 0 && (out->type & VIRTIO_BLK_T_FUA)) {
+ ret = fdatasync(vblk->fd);
+ verbose("FUA fdatasync: %i\n", ret);
+ }
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i\n", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
[at] [at] -1702,15 +1706,6 [at] [at] static void blk_request(struct virtqueue
}
}

- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* as well: "Barriers specify no reordering to the front
- * or the back." And Jens Axboe confirmed it, so here we are:
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
/* Finished that request. */
add_used(vq, head, wlen);
}
[at] [at] -1735,8 +1730,9 [at] [at] static void setup_block_file(const char
vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
vblk->len = lseek64(vblk->fd, 0, SEEK_END);

- /* We support barriers. */
- add_feature(dev, VIRTIO_BLK_F_BARRIER);
+ /* We support FLUSH and FUA. */
+ add_feature(dev, VIRTIO_BLK_F_FLUSH);
+ add_feature(dev, VIRTIO_BLK_F_FUA);

/* Tell Guest how many sectors this device has. */
conf.capacity = cpu_to_le64(vblk->len / 512);
Index: block/drivers/md/linear.c
============================================================ =======
--- block.orig/drivers/md/linear.c
+++ block/drivers/md/linear.c
[at] [at] -294,8 +294,8 [at] [at] static int linear_make_request (mddev_t
dev_info_t *tmp_dev;
sector_t start_sector;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/md.c
============================================================ =======
--- block.orig/drivers/md/md.c
+++ block/drivers/md/md.c
[at] [at] -226,12 +226,12 [at] [at] static int md_make_request(struct reques
return 0;
}
rcu_read_lock();
- if (mddev->suspended || mddev->barrier) {
+ if (mddev->suspended) {
DEFINE_WAIT(__wait);
for (;;) {
prepare_to_wait(&mddev->sb_wait, &__wait,
TASK_UNINTERRUPTIBLE);
- if (!mddev->suspended && !mddev->barrier)
+ if (!mddev->suspended)
break;
rcu_read_unlock();
schedule();
[at] [at] -280,40 +280,29 [at] [at] static void mddev_resume(mddev_t *mddev)

int mddev_congested(mddev_t *mddev, int bits)
{
- if (mddev->barrier)
- return 1;
return mddev->suspended;
}
EXPORT_SYMBOL(mddev_congested);

/*
- * Generic barrier handling for md
+ * Generic flush handling for md
*/

-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
mddev_t *mddev = rdev->mddev;
- if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
- set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);

rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
- if (mddev->barrier == POST_REQUEST_BARRIER) {
- /* This was a post-request barrier */
- mddev->barrier = NULL;
- wake_up(&mddev->sb_wait);
- } else
- /* The pre-request barrier has finished */
- schedule_work(&mddev->barrier_work);
+ /* The pre-request flush has finished */
+ schedule_work(&mddev->flush_work);
}
bio_put(bio);
}

-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
{
mdk_rdev_t *rdev;

[at] [at] -330,60 +319,56 [at] [at] static void submit_barriers(mddev_t *mdd
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
bi = bio_alloc(GFP_KERNEL, 0);
- bi->bi_end_io = md_end_barrier;
+ bi->bi_end_io = md_end_flush;
bi->bi_private = rdev;
bi->bi_bdev = rdev->bdev;
atomic_inc(&mddev->flush_pending);
- submit_bio(WRITE_BARRIER, bi);
+ submit_bio(WRITE_FLUSH, bi);
rcu_read_lock();
rdev_dec_pending(rdev, mddev);
}
rcu_read_unlock();
}

-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
{
- mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
- struct bio *bio = mddev->barrier;
+ mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+ struct bio *bio = mddev->flush_bio;

atomic_set(&mddev->flush_pending, 1);

- if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
- bio_endio(bio, -EOPNOTSUPP);
- else if (bio->bi_size == 0)
+ if (bio->bi_size == 0)
/* an empty barrier - all done */
bio_endio(bio, 0);
else {
- bio->bi_rw &= ~REQ_HARDBARRIER;
+ bio->bi_rw &= ~REQ_FLUSH;
if (mddev->pers->make_request(mddev, bio))
generic_make_request(bio);
- mddev->barrier = POST_REQUEST_BARRIER;
- submit_barriers(mddev);
}
if (atomic_dec_and_test(&mddev->flush_pending)) {
- mddev->barrier = NULL;
+ mddev->flush_bio = NULL;
wake_up(&mddev->sb_wait);
}
}

-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
{
spin_lock_irq(&mddev->write_lock);
wait_event_lock_irq(mddev->sb_wait,
- !mddev->barrier,
+ !mddev->flush_bio,
mddev->write_lock, /*nothing*/);
- mddev->barrier = bio;
+ mddev->flush_bio = bio;
spin_unlock_irq(&mddev->write_lock);

atomic_set(&mddev->flush_pending, 1);
- INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+ INIT_WORK(&mddev->flush_work, md_submit_flush_data);

- submit_barriers(mddev);
+ submit_flushes(mddev);

if (atomic_dec_and_test(&mddev->flush_pending))
- schedule_work(&mddev->barrier_work);
+ schedule_work(&mddev->flush_work);
}
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);

static inline mddev_t *mddev_get(mddev_t *mddev)
{
[at] [at] -642,31 +627,6 [at] [at] static void super_written(struct bio *bi
bio_put(bio);
}

-static void super_written_barrier(struct bio *bio, int error)
-{
- struct bio *bio2 = bio->bi_private;
- mdk_rdev_t *rdev = bio2->bi_private;
- mddev_t *mddev = rdev->mddev;
-
- if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
- error == -EOPNOTSUPP) {
- unsigned long flags;
- /* barriers don't appear to be supported :-( */
- set_bit(BarriersNotsupp, &rdev->flags);
- mddev->barriers_work = 0;
- spin_lock_irqsave(&mddev->write_lock, flags);
- bio2->bi_next = mddev->biolist;
- mddev->biolist = bio2;
- spin_unlock_irqrestore(&mddev->write_lock, flags);
- wake_up(&mddev->sb_wait);
- bio_put(bio);
- } else {
- bio_put(bio2);
- bio->bi_private = rdev;
- super_written(bio, error);
- }
-}
-
void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page)
{
[at] [at] -675,51 +635,28 [at] [at] void md_super_write(mddev_t *mddev, mdk_
* and decrement it on completion, waking up sb_wait
* if zero is reached.
* If an error occurred, call md_error
- *
- * As we might need to resubmit the request if REQ_HARDBARRIER
- * causes ENOTSUPP, we allocate a spare bio...
*/
struct bio *bio = bio_alloc(GFP_NOIO, 1);
- int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_private = rdev;
bio->bi_end_io = super_written;
- bio->bi_rw = rw;

atomic_inc(&mddev->pending_writes);
- if (!test_bit(BarriersNotsupp, &rdev->flags)) {
- struct bio *rbio;
- rw |= REQ_HARDBARRIER;
- rbio = bio_clone(bio, GFP_NOIO);
- rbio->bi_private = bio;
- rbio->bi_end_io = super_written_barrier;
- submit_bio(rw, rbio);
- } else
- submit_bio(rw, bio);
+ submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+ bio);
}

void md_super_wait(mddev_t *mddev)
{
- /* wait for all superblock writes that were scheduled to complete.
- * if any had to be retried (due to BARRIER problems), retry them
- */
+ /* wait for all superblock writes that were scheduled to complete */
DEFINE_WAIT(wq);
for(;;) {
prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
if (atomic_read(&mddev->pending_writes)==0)
break;
- while (mddev->biolist) {
- struct bio *bio;
- spin_lock_irq(&mddev->write_lock);
- bio = mddev->biolist;
- mddev->biolist = bio->bi_next ;
- bio->bi_next = NULL;
- spin_unlock_irq(&mddev->write_lock);
- submit_bio(bio->bi_rw, bio);
- }
schedule();
}
finish_wait(&mddev->sb_wait, &wq);
[at] [at] -1016,7 +953,6 [at] [at] static int super_90_validate(mddev_t *md
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 0;
[at] [at] -1431,7 +1367,6 [at] [at] static int super_1_validate(mddev_t *mdd
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 1;
[at] [at] -4463,7 +4398,6 [at] [at] static int md_run(mddev_t *mddev)
/* may be over-ridden by personality */
mddev->resync_max_sectors = mddev->dev_sectors;

- mddev->barriers_work = 1;
mddev->ok_start_degraded = start_dirty_degraded;

if (start_readonly && mddev->ro == 0)
[at] [at] -4638,7 +4572,6 [at] [at] static void md_clean(mddev_t *mddev)
mddev->recovery = 0;
mddev->in_sync = 0;
mddev->degraded = 0;
- mddev->barriers_work = 0;
mddev->safemode = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
Index: block/drivers/md/md.h
============================================================ =======
--- block.orig/drivers/md/md.h
+++ block/drivers/md/md.h
[at] [at] -67,7 +67,6 [at] [at] struct mdk_rdev_s
#define Faulty 1 /* device is known to have a fault */
#define In_sync 2 /* device is in_sync with rest of array */
#define WriteMostly 4 /* Avoid reading if at all possible */
-#define BarriersNotsupp 5 /* REQ_HARDBARRIER is not supported */
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
[at] [at] -249,13 +248,6 [at] [at] struct mddev_s
int degraded; /* whether md should consider
* adding a spare
*/
- int barriers_work; /* initialised to true, cleared as soon
- * as a barrier request to slave
- * fails. Only supported
- */
- struct bio *biolist; /* bios that need to be retried
- * because REQ_HARDBARRIER is not supported
- */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
[at] [at] -308,16 +300,13 [at] [at] struct mddev_s
struct list_head all_mddevs;

struct attribute_group *to_remove;
- /* Generic barrier handling.
- * If there is a pending barrier request, all other
- * writes are blocked while the devices are flushed.
- * The last to finish a flush schedules a worker to
- * submit the barrier request (without the barrier flag),
- * then submit more flush requests.
+ /* Generic flush handling.
+ * The last to finish preflush schedules a worker to submit
+ * the rest of the request (without the REQ_FLUSH flag).
*/
- struct bio *barrier;
+ struct bio *flush_bio;
atomic_t flush_pending;
- struct work_struct barrier_work;
+ struct work_struct flush_work;
};


[at] [at] -458,7 +447,7 [at] [at] extern void md_done_sync(mddev_t *mddev,
extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);

extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(mddev_t *mddev);
Index: block/drivers/md/raid0.c
============================================================ =======
--- block.orig/drivers/md/raid0.c
+++ block/drivers/md/raid0.c
[at] [at] -483,8 +483,8 [at] [at] static int raid0_make_request(mddev_t *m
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid1.c
============================================================ =======
--- block.orig/drivers/md/raid1.c
+++ block/drivers/md/raid1.c
[at] [at] -319,83 +319,74 [at] [at] static void raid1_end_write_request(stru
if (r1_bio->bios[mirror] == bio)
break;

- if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
- set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
- set_bit(R1BIO_BarrierRetry, &r1_bio->state);
- r1_bio->mddev->barriers_work = 0;
- /* Don't rdev_dec_pending in this branch - keep it for the retry */
- } else {
+ /*
+ * 'one mirror IO has finished' event handler:
+ */
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
+ if (!uptodate) {
+ md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+ /* an I/O failed, we can't clear the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ } else
/*
- * this branch is our 'one mirror IO has finished' event handler:
+ * Set R1BIO_Uptodate in our master bio, so that we
+ * will return a good error code for to the higher
+ * levels even if IO on some other mirrored buffer
+ * fails.
+ *
+ * The 'master' represents the composite IO operation
+ * to user-side. So if something waits for IO, then it
+ * will wait for the 'master' bio.
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
- if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
- /*
- * Set R1BIO_Uptodate in our master bio, so that
- * we will return a good error code for to the higher
- * levels even if IO on some other mirrored buffer fails.
- *
- * The 'master' represents the composite IO operation to
- * user-side. So if something waits for IO, then it will
- * wait for the 'master' bio.
- */
- set_bit(R1BIO_Uptodate, &r1_bio->state);
+ set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+ update_head_pos(mirror, r1_bio);

- update_head_pos(mirror, r1_bio);
+ if (behind) {
+ if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+ atomic_dec(&r1_bio->behind_remaining);

- if (behind) {
- if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
- atomic_dec(&r1_bio->behind_remaining);
-
- /* In behind mode, we ACK the master bio once the I/O has safely
- * reached all non-writemostly disks. Setting the Returned bit
- * ensures that this gets done only once -- we don't ever want to
- * return -EIO here, instead we'll wait */
-
- if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
- test_bit(R1BIO_Uptodate, &r1_bio->state)) {
- /* Maybe we can return now */
- if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
- struct bio *mbio = r1_bio->master_bio;
- PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
- (unsigned long long) mbio->bi_sector,
- (unsigned long long) mbio->bi_sector +
- (mbio->bi_size >> 9) - 1);
- bio_endio(mbio, 0);
- }
+ /*
+ * In behind mode, we ACK the master bio once the I/O
+ * has safely reached all non-writemostly
+ * disks. Setting the Returned bit ensures that this
+ * gets done only once -- we don't ever want to return
+ * -EIO here, instead we'll wait
+ */
+ if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+ test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+ /* Maybe we can return now */
+ if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+ struct bio *mbio = r1_bio->master_bio;
+ PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
+ (unsigned long long) mbio->bi_sector,
+ (unsigned long long) mbio->bi_sector +
+ (mbio->bi_size >> 9) - 1);
+ bio_endio(mbio, 0);
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
}
+ rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
/*
- *
* Let's see if all mirrored write operations have finished
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
- reschedule_retry(r1_bio);
- else {
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = bio->bi_vcnt;
- while (i--)
- safe_put_page(bio->bi_io_vec[i].bv_page);
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- behind);
- md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
- }
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = bio->bi_vcnt;
+ while (i--)
+ safe_put_page(bio->bi_io_vec[i].bv_page);
+ }
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ behind);
+ md_write_end(r1_bio->mddev);
+ raid_end_bio_io(r1_bio);
}

if (to_put)
[at] [at] -787,17 +778,14 [at] [at] static int make_request(mddev_t *mddev,
struct bio_list bl;
struct page **behind_pages = NULL;
const int rw = bio_data_dir(bio);
- const bool do_sync = (bio->bi_rw & REQ_SYNC);
- bool do_barriers;
+ const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
mdk_rdev_t *blocked_rdev;

/*
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
* Continue immediately if no resync is active currently.
- * We test barriers_work *after* md_write_start as md_write_start
- * may cause the first superblock write, and that will check out
- * if barriers work.
*/

md_write_start(mddev, bio); /* wait on superblock update early */
[at] [at] -821,13 +809,6 [at] [at] static int make_request(mddev_t *mddev,
}
finish_wait(&conf->wait_barrier, &w);
}
- if (unlikely(!mddev->barriers_work &&
- (bio->bi_rw & REQ_HARDBARRIER))) {
- if (rw == WRITE)
- md_write_end(mddev);
- bio_endio(bio, -EOPNOTSUPP);
- return 0;
- }

wait_barrier(conf);

[at] [at] -877,7 +858,7 [at] [at] static int make_request(mddev_t *mddev,
read_bio->bi_sector = r1_bio->sector + mirror->rdev->data_offset;
read_bio->bi_bdev = mirror->rdev->bdev;
read_bio->bi_end_io = raid1_end_read_request;
- read_bio->bi_rw = READ | do_sync;
+ read_bio->bi_rw = READ | do_sync | do_flush_fua;
read_bio->bi_private = r1_bio;

generic_make_request(read_bio);
[at] [at] -959,10 +940,6 [at] [at] static int make_request(mddev_t *mddev,
atomic_set(&r1_bio->remaining, 0);
atomic_set(&r1_bio->behind_remaining, 0);

- do_barriers = bio->bi_rw & REQ_HARDBARRIER;
- if (do_barriers)
- set_bit(R1BIO_Barrier, &r1_bio->state);
-
bio_list_init(&bl);
for (i = 0; i < disks; i++) {
struct bio *mbio;
[at] [at] -975,7 +952,7 [at] [at] static int make_request(mddev_t *mddev,
mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | do_barriers | do_sync;
+ mbio->bi_rw = WRITE | do_sync;
mbio->bi_private = r1_bio;

if (behind_pages) {
[at] [at] -1631,41 +1608,6 [at] [at] static void raid1d(mddev_t *mddev)
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
sync_request_write(mddev, r1_bio);
unplug = 1;
- } else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
- /* some requests in the r1bio were REQ_HARDBARRIER
- * requests which failed with -EOPNOTSUPP. Hohumm..
- * Better resubmit without the barrier.
- * We know which devices to resubmit for, because
- * all others have had their bios[] entry cleared.
- * We already have a nr_pending reference on these rdevs.
- */
- int i;
- const bool do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
- clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
- clear_bit(R1BIO_Barrier, &r1_bio->state);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i])
- atomic_inc(&r1_bio->remaining);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i]) {
- struct bio_vec *bvec;
- int j;
-
- bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
- /* copy pages from the failed bio, as
- * this might be a write-behind device */
- __bio_for_each_segment(bvec, bio, j, 0)
- bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
- bio_put(r1_bio->bios[i]);
- bio->bi_sector = r1_bio->sector +
- conf->mirrors[i].rdev->data_offset;
- bio->bi_bdev = conf->mirrors[i].rdev->bdev;
- bio->bi_end_io = raid1_end_write_request;
- bio->bi_rw = WRITE | do_sync;
- bio->bi_private = r1_bio;
- r1_bio->bios[i] = bio;
- generic_make_request(bio);
- }
} else {
int disk;

Index: block/drivers/md/raid1.h
============================================================ =======
--- block.orig/drivers/md/raid1.h
+++ block/drivers/md/raid1.h
[at] [at] -117,8 +117,6 [at] [at] struct r1bio_s {
#define R1BIO_IsSync 1
#define R1BIO_Degraded 2
#define R1BIO_BehindIO 3
-#define R1BIO_Barrier 4
-#define R1BIO_BarrierRetry 5
/* For write-behind requests, we call bi_end_io when
* the last non-write-behind device completes, providing
* any write was successful. Otherwise we call when
Index: block/drivers/md/raid5.c
============================================================ =======
--- block.orig/drivers/md/raid5.c
+++ block/drivers/md/raid5.c
[at] [at] -3278,7 +3278,7 [at] [at] static void handle_stripe5(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
[at] [at] -3580,7 +3580,7 [at] [at] static void handle_stripe6(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
[at] [at] -3958,14 +3958,8 [at] [at] static int make_request(mddev_t *mddev,
const int rw = bio_data_dir(bi);
int remaining;

- if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
- /* Drain all pending writes. We only really need
- * to ensure they have been submitted, but this is
- * easier.
- */
- mddev->pers->quiesce(mddev, 1);
- mddev->pers->quiesce(mddev, 0);
- md_barrier_request(mddev, bi);
+ if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bi);
return 0;
}

[at] [at] -4083,7 +4077,7 [at] [at] static int make_request(mddev_t *mddev,
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if (mddev->barrier &&
+ if (mddev->flush_bio &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh);
[at] [at] -4106,7 +4100,7 [at] [at] static int make_request(mddev_t *mddev,
bio_endio(bi, 0);
}

- if (mddev->barrier) {
+ if (mddev->flush_bio) {
/* We need to wait for the stripes to all be handled.
* So: wait for preread_active_stripes to drop to 0.
*/
Index: block/drivers/md/multipath.c
============================================================ =======
--- block.orig/drivers/md/multipath.c
+++ block/drivers/md/multipath.c
[at] [at] -142,8 +142,8 [at] [at] static int multipath_make_request(mddev_
struct multipath_bh * mp_bh;
struct multipath_info *multipath;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid10.c
============================================================ =======
--- block.orig/drivers/md/raid10.c
+++ block/drivers/md/raid10.c
[at] [at] -799,13 +799,13 [at] [at] static int make_request(mddev_t *mddev,
int i;
int chunk_sects = conf->chunk_mask + 1;
const int rw = bio_data_dir(bio);
- const bool do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
struct bio_list bl;
unsigned long flags;
mdk_rdev_t *blocked_rdev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/dm-io.c
============================================================ =======
--- block.orig/drivers/md/dm-io.c
+++ block/drivers/md/dm-io.c
[at] [at] -31,7 +31,6 [at] [at] struct dm_io_client {
*/
struct io {
unsigned long error_bits;
- unsigned long eopnotsupp_bits;
atomic_t count;
struct task_struct *sleeper;
struct dm_io_client *client;
[at] [at] -130,11 +129,8 [at] [at] static void retrieve_io_and_region_from_
*----------------------------------------------------------- ----*/
static void dec_count(struct io *io, unsigned int region, int error)
{
- if (error) {
+ if (error)
set_bit(region, &io->error_bits);
- if (error == -EOPNOTSUPP)
- set_bit(region, &io->eopnotsupp_bits);
- }

if (atomic_dec_and_test(&io->count)) {
if (io->sleeper)
[at] [at] -310,8 +306,8 [at] [at] static void do_region(int rw, unsigned r
sector_t remaining = where->count;

/*
- * where->count may be zero if rw holds a write barrier and we
- * need to send a zero-sized barrier.
+ * where->count may be zero if rw holds a flush and we need to
+ * send a zero-sized flush.
*/
do {
/*
[at] [at] -364,7 +360,7 [at] [at] static void dispatch_io(int rw, unsigned
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

[at] [at] -393,9 +389,7 [at] [at] static int sync_io(struct dm_io_client *
return -EIO;
}

-retry:
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = current;
io->client = client;
[at] [at] -412,11 +406,6 [at] [at] retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
- goto retry;
- }
-
if (error_bits)
*error_bits = io->error_bits;

[at] [at] -437,7 +426,6 [at] [at] static int async_io(struct dm_io_client

io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = NULL;
io->client = client;
Index: block/drivers/md/dm-raid1.c
============================================================ =======
--- block.orig/drivers/md/dm-raid1.c
+++ block/drivers/md/dm-raid1.c
[at] [at] -259,7 +259,7 [at] [at] static int mirror_flush(struct dm_target
struct dm_io_region io[ms->nr_mirrors];
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE_BARRIER,
+ .bi_rw = WRITE_FLUSH,
.mem.type = DM_IO_KMEM,
.mem.ptr.bvec = NULL,
.client = ms->io_client,
[at] [at] -629,7 +629,7 [at] [at] static void do_write(struct mirror_set *
struct dm_io_region io[ms->nr_mirrors], *dest = io;
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+ .bi_rw = WRITE | (bio->bi_rw & (WRITE_FLUSH | WRITE_FUA)),
.mem.type = DM_IO_BVEC,
.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
.notify.fn = write_callback,
[at] [at] -670,7 +670,7 [at] [at] static void do_writes(struct mirror_set
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
bio_list_add(&sync, bio);
continue;
}
[at] [at] -1203,7 +1203,7 [at] [at] static int mirror_end_io(struct dm_targe
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH) || bio_has_data(bio))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
Index: block/drivers/md/dm.c
============================================================ =======
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
[at] [at] -139,21 +139,21 [at] [at] struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
+ * An error from the flush request currently being processed.
*/
- int barrier_error;
+ int flush_error;

/*
- * Protect barrier_error from concurrent endio processing
+ * Protect flush_error from concurrent endio processing
* in request-based dm.
*/
- spinlock_t barrier_error_lock;
+ spinlock_t flush_error_lock;

/*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
+ struct work_struct flush_work;

/* A pointer to the currently processing pre/post flush request */
struct request *flush_request;
[at] [at] -195,8 +195,8 [at] [at] struct mapped_device {
/* sysfs handle */
struct kobject kobj;

- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
+ /* zero-length flush that will be cloned and submitted to targets */
+ struct bio flush_bio;
};

/*
[at] [at] -507,7 +507,7 [at] [at] static void end_io_acct(struct dm_io *io

/*
* After this is decremented the bio must not be touched if it is
- * a barrier.
+ * a flush.
*/
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
[at] [at] -621,7 +621,7 [at] [at] static void dec_pending(struct dm_io *io
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & REQ_FLUSH))
bio_list_add_head(&md->deferred,
io->bio);
} else
[at] [at] -633,14 +633,14 [at] [at] static void dec_pending(struct dm_io *io
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & REQ_FLUSH) {
/*
- * There can be just one barrier request so we use
+ * There can be just one flush request so we use
* a per-device variable for error reporting.
* Note that you can't touch the bio after end_io_acct
*/
- if (!md->barrier_error && io_error != -EOPNOTSUPP)
- md->barrier_error = io_error;
+ if (!md->flush_error)
+ md->flush_error = io_error;
end_io_acct(io);
free_io(md, io);
} else {
[at] [at] -744,21 +744,18 [at] [at] static void end_clone_bio(struct bio *cl
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
+static void store_flush_error(struct mapped_device *md, int error)
{
unsigned long flags;

- spin_lock_irqsave(&md->barrier_error_lock, flags);
+ spin_lock_irqsave(&md->flush_error_lock, flags);
/*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
+ * Basically, the first error is taken, but requeue request
+ * supersedes any I/O error.
+ */
+ if (!md->flush_error || error == DM_ENDIO_REQUEUE)
+ md->flush_error = error;
+ spin_unlock_irqrestore(&md->flush_error_lock, flags);
}

/*
[at] [at] -799,12 +796,12 [at] [at] static void dm_end_request(struct reques
{
int rw = rq_data_dir(clone);
int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
+ bool is_flush = clone->cmd_flags & REQ_FLUSH;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

[at] [at] -819,12 +816,13 [at] [at] static void dm_end_request(struct reques

free_rq_clone(clone);

- if (unlikely(is_barrier)) {
+ if (!is_flush)
+ blk_end_request_all(rq, error);
+ else {
if (unlikely(error))
- store_barrier_error(md, error);
+ store_flush_error(md, error);
run_queue = 0;
- } else
- blk_end_request_all(rq, error);
+ }

rq_completed(md, rw, run_queue);
}
[at] [at] -851,9 +849,9 [at] [at] void dm_requeue_unmapped_request(struct
struct request_queue *q = rq->q;
unsigned long flags;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
[at] [at] -950,14 +948,14 [at] [at] static void dm_complete_request(struct r
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request. So can't use
+ * Flush clones share an original request. So can't use
* softirq_done with the original.
* Pass the clone to dm_done() directly in this special case.
* It is safe (even if clone->q->queue_lock is held here)
* because there is no I/O dispatching during the completion
- * of barrier clone.
+ * of flush clone.
*/
dm_done(clone, error, true);
return;
[at] [at] -979,9 +977,9 [at] [at] void dm_kill_unmapped_request(struct req
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
[at] [at] -1098,7 +1096,7 [at] [at] static void dm_bio_destructor(struct bio
}

/*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that is just a part of a bvec.
*/
static struct bio *split_bvec(struct bio *bio, sector_t sector,
unsigned short idx, unsigned int offset,
[at] [at] -1113,7 +1111,7 [at] [at] static struct bio *split_bvec(struct bio

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
[at] [at] -1140,7 +1138,6 [at] [at] static struct bio *clone_bio(struct bio

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
[at] [at] -1186,7 +1183,7 [at] [at] static void __flush_target(struct clone_
__map_bio(ti, clone, tio);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
{
unsigned target_nr = 0, flush_nr;
struct dm_target *ti;
[at] [at] -1208,9 +1205,6 [at] [at] static int __clone_and_map(struct clone_
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
-
ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
return -EIO;
[at] [at] -1308,11 +1302,11 [at] [at] static void __split_and_process_bio(stru

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
+ if (!(bio->bi_rw & REQ_FLUSH))
bio_io_error(bio);
else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ if (!md->flush_error)
+ md->flush_error = -EIO;
return;
}

[at] [at] -1325,14 +1319,22 [at] [at] static void __split_and_process_bio(stru
ci.io->md = md;
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
- ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH))
+ ci.sector_count = bio_sectors(bio);
+ else {
+ /* FLUSH bio reaching here should all be empty */
+ WARN_ON_ONCE(bio_has_data(bio));
ci.sector_count = 1;
+ }
ci.idx = bio->bi_idx;

start_io_acct(ci.io);
- while (ci.sector_count && !error)
- error = __clone_and_map(&ci);
+ while (ci.sector_count && !error) {
+ if (!(bio->bi_rw & REQ_FLUSH))
+ error = __clone_and_map(&ci);
+ else
+ error = __clone_and_map_flush(&ci);
+ }

/* drop the extra reference count */
dec_pending(ci.io, error);
[at] [at] -1417,11 +1419,11 [at] [at] static int _dm_request(struct request_qu
part_stat_unlock();

/*
- * If we're suspended or the thread is processing barriers
+ * If we're suspended or the thread is processing flushes
* we have to queue this io for later.
*/
if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ (bio->bi_rw & REQ_FLUSH)) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
[at] [at] -1464,10 +1466,7 [at] [at] static int dm_request(struct request_que

static bool dm_rq_is_flush_request(struct request *rq)
{
- if (rq->cmd_flags & REQ_FLUSH)
- return true;
- else
- return false;
+ return rq->cmd_flags & REQ_FLUSH;
}

void dm_dispatch_request(struct request *rq)
[at] [at] -1520,7 +1519,7 [at] [at] static int setup_clone(struct request *c
if (dm_rq_is_flush_request(rq)) {
blk_rq_init(NULL, clone);
clone->cmd_type = REQ_TYPE_FS;
- clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
+ clone->cmd_flags |= (REQ_FLUSH | WRITE);
} else {
r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
dm_rq_bio_constructor, tio);
[at] [at] -1668,7 +1667,7 [at] [at] static void dm_request_fn(struct request
BUG_ON(md->flush_request);
md->flush_request = rq;
blk_start_request(rq);
- queue_work(md->wq, &md->barrier_work);
+ queue_work(md->wq, &md->flush_work);
goto out;
}

[at] [at] -1843,7 +1842,7 [at] [at] out:
static const struct block_device_operations dm_blk_dops;

static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
+static void dm_rq_flush_work(struct work_struct *work);

/*
* Allocate and initialise a blank device with a given minor.
[at] [at] -1873,7 +1872,7 [at] [at] static struct mapped_device *alloc_dev(i
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
+ spin_lock_init(&md->flush_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
[at] [at] -1918,7 +1917,7 [at] [at] static struct mapped_device *alloc_dev(i
atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
- INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
+ INIT_WORK(&md->flush_work, dm_rq_flush_work);
init_waitqueue_head(&md->eventq);

md->disk->major = _major;
[at] [at] -2233,31 +2232,28 [at] [at] static int dm_wait_for_completion(struct
return r;
}

-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
{
+ md->flush_error = 0;
+
+ /* handle REQ_FLUSH */
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);

- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
+ bio_init(&md->flush_bio);
+ md->flush_bio.bi_bdev = md->bdev;
+ md->flush_bio.bi_rw = WRITE_FLUSH;
+ __split_and_process_bio(md, &md->flush_bio);

dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;

- dm_flush(md);
+ bio->bi_rw &= ~REQ_FLUSH;

- if (!bio_empty_barrier(bio)) {
+ /* handle data + REQ_FUA */
+ if (bio_has_data(bio))
__split_and_process_bio(md, bio);
- dm_flush(md);
- }

- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
+ if (md->flush_error != DM_ENDIO_REQUEUE)
+ bio_endio(bio, md->flush_error);
else {
spin_lock_irq(&md->deferred_lock);
bio_list_add_head(&md->deferred, bio);
[at] [at] -2291,8 +2287,8 [at] [at] static void dm_wq_work(struct work_struc
if (dm_request_based(md))
generic_make_request(c);
else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
+ if (c->bi_rw & REQ_FLUSH)
+ process_flush(md, c);
else
__split_and_process_bio(md, c);
}
[at] [at] -2317,8 +2313,8 [at] [at] static void dm_rq_set_flush_nr(struct re
tio->info.flush_request = flush_nr;
}

-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
+/* Issue flush requests to targets and wait for their completion. */
+static int dm_rq_flush(struct mapped_device *md)
{
int i, j;
struct dm_table *map = dm_get_live_table(md);
[at] [at] -2326,7 +2322,7 [at] [at] static int dm_rq_barrier(struct mapped_d
struct dm_target *ti;
struct request *clone;

- md->barrier_error = 0;
+ md->flush_error = 0;

for (i = 0; i < num_targets; i++) {
ti = dm_table_get_target(map, i);
[at] [at] -2341,26 +2337,26 [at] [at] static int dm_rq_barrier(struct mapped_d
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
dm_table_put(map);

- return md->barrier_error;
+ return md->flush_error;
}

-static void dm_rq_barrier_work(struct work_struct *work)
+static void dm_rq_flush_work(struct work_struct *work)
{
int error;
struct mapped_device *md = container_of(work, struct mapped_device,
- barrier_work);
+ flush_work);
struct request_queue *q = md->queue;
struct request *rq;
unsigned long flags;

/*
* Hold the md reference here and leave it at the last part so that
- * the md can't be deleted by device opener when the barrier request
+ * the md can't be deleted by device opener when the flush request
* completes.
*/
dm_get(md);

- error = dm_rq_barrier(md);
+ error = dm_rq_flush(md);

rq = md->flush_request;
md->flush_request = NULL;
[at] [at] -2520,7 +2516,7 [at] [at] int dm_suspend(struct mapped_device *md,
up_write(&md->io_lock);

/*
- * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
+ * Request-based dm uses md->wq for flush (dm_rq_flush_work) which
* can be kicked until md->queue is stopped. So stop md->queue before
* flushing md->wq.
*/
Index: block/drivers/md/dm-log.c
============================================================ =======
--- block.orig/drivers/md/dm-log.c
+++ block/drivers/md/dm-log.c
[at] [at] -300,7 +300,7 [at] [at] static int flush_header(struct log_c *lc
.count = 0,
};

- lc->io_req.bi_rw = WRITE_BARRIER;
+ lc->io_req.bi_rw = WRITE_FLUSH;

return dm_io(&lc->io_req, 1, &null_location, NULL);
}
Index: block/drivers/md/dm-snap-persistent.c
============================================================ =======
--- block.orig/drivers/md/dm-snap-persistent.c
+++ block/drivers/md/dm-snap-persistent.c
[at] [at] -687,7 +687,7 [at] [at] static void persistent_commit_exception(
/*
* Commit exceptions to disk.
*/
- if (ps->valid && area_io(ps, WRITE_BARRIER))
+ if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
ps->valid = 0;

/*

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Fr, 13 August 2010 16:51 ] [ ID #2045973 ]

Re: [PATCH 03/11] block: deprecate barrier and replace

On Fri, Aug 13, 2010 at 06:07:13PM -0700, Jeremy Fitzhardinge wrote:
> On 08/12/2010 05:41 AM, Tejun Heo wrote:
> > Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
> > requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
> > -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
> > blk_queue_flush().
> >
> > blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
> > device has write cache and can flush it, it should set REQ_FLUSH. If
> > the device can handle FUA writes, it should also set REQ_FUA.
>
> Christoph, do these two patches (parts 2 and 3) make xen-blkfront
> correct WRT barriers/flushing as far as your concerned?

If all your backends handle a zero-length BLKIF_OP_WRITE_BARRIER request
it is a fully correct, but rather suboptimal implementation. To get
all the benefit of the new non-draining barriers you'll need a new
If all your backends handle a zero-length BLKIF_OP_FLUSH request that
only flushes the cache, but has no ordering side effects. Note that
the quite suboptimal here means not as good as the new barrier
implementation, but it shouldn't be notiably worse than the old one
for Xen.
hch [ Sa, 14 August 2010 11:42 ] [ ID #2046024 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
> Do you want to change the whole thing in a single commit? That would
> be a pretty big invasive patch touching multiple subsystems.

We can just stop draining in the block layer in the first patch, then
stop doing the stuff in md/dm/etc in the following and then do the
final renaming patches. It would still be less patches then now, but
keep things working through the whole transition, which would really
help biseting any problems.

> + if (req->cmd_flags & REQ_FUA)
> + vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;

I'd suggest not adding FUA support to virtio yet. Just using the flush
feature gives you a fully working barrier implementation.

Eventually we might want to add a flag in the block queue to send
REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
avoid separate pre- and post flushes, but I really want to benchmark if
it makes an impact on real life setups first.

> Index: block/drivers/md/linear.c
> ============================================================ =======
> --- block.orig/drivers/md/linear.c
> +++ block/drivers/md/linear.c
> [at] [at] -294,8 +294,8 [at] [at] static int linear_make_request (mddev_t
> dev_info_t *tmp_dev;
> sector_t start_sector;
>
> - if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> - md_barrier_request(mddev, bio);
> + if (unlikely(bio->bi_rw & REQ_FLUSH)) {
> + md_flush_request(mddev, bio);

We only need the special md_flush_request handling for
empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
flag propagated to the underlying devices.

> +static void md_end_flush(struct bio *bio, int err)
> {
> mdk_rdev_t *rdev = bio->bi_private;
> mddev_t *mddev = rdev->mddev;
>
> rdev_dec_pending(rdev, mddev);
>
> if (atomic_dec_and_test(&mddev->flush_pending)) {
> + /* The pre-request flush has finished */
> + schedule_work(&mddev->flush_work);

Once we only handle empty barriers here we can directly call bio_endio
instead of first scheduling a work queue.Once we only handle empty
barriers here we can directly call bio_endio and the super wakeup
instead of first scheduling a work queue.

> while ((bio = bio_list_pop(writes))) {
> - if (unlikely(bio_empty_barrier(bio))) {
> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {

I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
useful macro for the bio based drivers.

> [at] [at] -621,7 +621,7 [at] [at] static void dec_pending(struct dm_io *io
> */
> spin_lock_irqsave(&md->deferred_lock, flags);
> if (__noflush_suspending(md)) {
> - if (!(io->bio->bi_rw & REQ_HARDBARRIER))
> + if (!(io->bio->bi_rw & REQ_FLUSH))

I suspect we don't actually need to special case flushes here anymore.


> [at] [at] -633,14 +633,14 [at] [at] static void dec_pending(struct dm_io *io
> io_error = io->error;
> bio = io->bio;
>
> - if (bio->bi_rw & REQ_HARDBARRIER) {
> + if (bio->bi_rw & REQ_FLUSH) {
> /*
> - * There can be just one barrier request so we use
> + * There can be just one flush request so we use
> * a per-device variable for error reporting.
> * Note that you can't touch the bio after end_io_acct
> */
> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
> - md->barrier_error = io_error;
> + if (!md->flush_error)
> + md->flush_error = io_error;

And we certainly do not need any special casing here. See my patch.

> {
> int rw = rq_data_dir(clone);
> int run_queue = 1;
> - bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
> + bool is_flush = clone->cmd_flags & REQ_FLUSH;
> struct dm_rq_target_io *tio = clone->end_io_data;
> struct mapped_device *md = tio->md;
> struct request *rq = tio->orig;
>
> - if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
> + if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {

We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
for the second half of this conditional.

> + if (!is_flush)
> + blk_end_request_all(rq, error);
> + else {
> if (unlikely(error))
> - store_barrier_error(md, error);
> + store_flush_error(md, error);
> run_queue = 0;
> - } else
> - blk_end_request_all(rq, error);
> + }

Flush requests can now be completed normally.

> [at] [at] -1308,11 +1302,11 [at] [at] static void __split_and_process_bio(stru
>
> ci.map = dm_get_live_table(md);
> if (unlikely(!ci.map)) {
> - if (!(bio->bi_rw & REQ_HARDBARRIER))
> + if (!(bio->bi_rw & REQ_FLUSH))
> bio_io_error(bio);
> else
> - if (!md->barrier_error)
> - md->barrier_error = -EIO;
> + if (!md->flush_error)
> + md->flush_error = -EIO;

No need for the special error handling here, flush requests can now
be completed normally.

> [at] [at] -1417,11 +1419,11 [at] [at] static int _dm_request(struct request_qu
> part_stat_unlock();
>
> /*
> - * If we're suspended or the thread is processing barriers
> + * If we're suspended or the thread is processing flushes
> * we have to queue this io for later.
> */
> if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
> - unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> + (bio->bi_rw & REQ_FLUSH)) {
> up_read(&md->io_lock);

AFAICS this is only needed for the old barrier code, no need for this
for pure flushes.

> [at] [at] -1464,10 +1466,7 [at] [at] static int dm_request(struct request_que
>
> static bool dm_rq_is_flush_request(struct request *rq)
> {
> - if (rq->cmd_flags & REQ_FLUSH)
> - return true;
> - else
> - return false;
> + return rq->cmd_flags & REQ_FLUSH;
> }

It's probably worth just killing this wrapper.


> void dm_dispatch_request(struct request *rq)
> [at] [at] -1520,7 +1519,7 [at] [at] static int setup_clone(struct request *c
> if (dm_rq_is_flush_request(rq)) {
> blk_rq_init(NULL, clone);
> clone->cmd_type = REQ_TYPE_FS;
> - clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> + clone->cmd_flags |= (REQ_FLUSH | WRITE);
> } else {
> r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> dm_rq_bio_constructor, tio);

My suspicion is that we can get rif of all that special casing here
and just use blk_rq_prep_clone once it's been updated to propagate
REQ_FLUSH, similar to the DISCARD flag.

I also suspect that there is absolutely no need to the barrier work
queue once we stop waiting for outstanding request. But then again
the request based dm code still somewhat confuses me.

> +static void process_flush(struct mapped_device *md, struct bio *bio)
> {
> + md->flush_error = 0;
> +
> + /* handle REQ_FLUSH */
> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>
> - bio_init(&md->barrier_bio);
> - md->barrier_bio.bi_bdev = md->bdev;
> - md->barrier_bio.bi_rw = WRITE_BARRIER;
> - __split_and_process_bio(md, &md->barrier_bio);
> + bio_init(&md->flush_bio);
> + md->flush_bio.bi_bdev = md->bdev;
> + md->flush_bio.bi_rw = WRITE_FLUSH;
> + __split_and_process_bio(md, &md->flush_bio);

There's not need to use a separate flush_bio here.
__split_and_process_bio does the right thing for empty REQ_FLUSH
requests. See my patch for how to do this differenty. And yeah,
my version has been tested.
Christoph Hellwig [ Sa, 14 August 2010 12:36 ] [ ID #2046025 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello, Christoph.

On 08/14/2010 12:36 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
>> Do you want to change the whole thing in a single commit? That would
>> be a pretty big invasive patch touching multiple subsystems.
>
> We can just stop draining in the block layer in the first patch, then
> stop doing the stuff in md/dm/etc in the following and then do the
> final renaming patches. It would still be less patches then now, but
> keep things working through the whole transition, which would really
> help biseting any problems.

I'm not really convinced that would help much. If bisecting can point
to the conversion as the culprit for whatever kind of failure,
wouldn't that be enough? No matter what we do the conversion will be
a single step thing. If we make the filesystems enforce the ordering
first and then relax ordering in the block layer, bisection would
still just point at the later patch. The same goes for md/dm, the
best we can find out would be whether the conversion is correct or not
anyway.

I'm not against restructuring the patchset if it makes more sense but
it just feels like it would be a bit pointless effort (and one which
would require much tighter coordination among different trees) at this
point. Am I missing something?

>> + if (req->cmd_flags & REQ_FUA)
>> + vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
>
> I'd suggest not adding FUA support to virtio yet. Just using the flush
> feature gives you a fully working barrier implementation.
>
> Eventually we might want to add a flag in the block queue to send
> REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
> avoid separate pre- and post flushes, but I really want to benchmark if
> it makes an impact on real life setups first.

I wrote this in the other mail but I think it would make difference if
the backend storag is md/dm especially if it's shared by multiple VMs.
It cuts down on one array wide cache flush.

>> Index: block/drivers/md/linear.c
>> ============================================================ =======
>> --- block.orig/drivers/md/linear.c
>> +++ block/drivers/md/linear.c
>> [at] [at] -294,8 +294,8 [at] [at] static int linear_make_request (mddev_t
>> dev_info_t *tmp_dev;
>> sector_t start_sector;
>>
>> - if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> - md_barrier_request(mddev, bio);
>> + if (unlikely(bio->bi_rw & REQ_FLUSH)) {
>> + md_flush_request(mddev, bio);
>
> We only need the special md_flush_request handling for
> empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
> flag propagated to the underlying devices.

Hmm, not really, the WRITE should happen after all the data in cache
are committed to NV media, meaning that empty FLUSH should already
have finished by the time the WRITE starts.

>> +static void md_end_flush(struct bio *bio, int err)
>> {
>> mdk_rdev_t *rdev = bio->bi_private;
>> mddev_t *mddev = rdev->mddev;
>>
>> rdev_dec_pending(rdev, mddev);
>>
>> if (atomic_dec_and_test(&mddev->flush_pending)) {
>> + /* The pre-request flush has finished */
>> + schedule_work(&mddev->flush_work);
>
> Once we only handle empty barriers here we can directly call bio_endio
> instead of first scheduling a work queue.Once we only handle empty
> barriers here we can directly call bio_endio and the super wakeup
> instead of first scheduling a work queue.

Yeap, right. That would be a nice optimization.

>> while ((bio = bio_list_pop(writes))) {
>> - if (unlikely(bio_empty_barrier(bio))) {
>> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
>
> I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> useful macro for the bio based drivers.

Hmm... maybe. The reason why I removed bio_empty_flush() was that
except for the front-most sequencer (block layer for all the request
based ones and the front-most make_request for bio based ones), it
doesn't make sense to see REQ_FLUSH + data bios. They should be
sequenced at the front-most stage anyway, so I didn't have much use
for them. Those code paths couldn't deal with REQ_FLUSH + data bios
anyway.

>> [at] [at] -621,7 +621,7 [at] [at] static void dec_pending(struct dm_io *io
>> */
>> spin_lock_irqsave(&md->deferred_lock, flags);
>> if (__noflush_suspending(md)) {
>> - if (!(io->bio->bi_rw & REQ_HARDBARRIER))
>> + if (!(io->bio->bi_rw & REQ_FLUSH))
>
> I suspect we don't actually need to special case flushes here anymore.

Oh, I'm not sure about this part at all. I'll ask Mike.

>> [at] [at] -633,14 +633,14 [at] [at] static void dec_pending(struct dm_io *io
>> io_error = io->error;
>> bio = io->bio;
>>
>> - if (bio->bi_rw & REQ_HARDBARRIER) {
>> + if (bio->bi_rw & REQ_FLUSH) {
>> /*
>> - * There can be just one barrier request so we use
>> + * There can be just one flush request so we use
>> * a per-device variable for error reporting.
>> * Note that you can't touch the bio after end_io_acct
>> */
>> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
>> - md->barrier_error = io_error;
>> + if (!md->flush_error)
>> + md->flush_error = io_error;
>
> And we certainly do not need any special casing here. See my patch.

I wasn't sure about that part. You removed store_flush_error(), but
DM_ENDIO_REQUEUE should still have higher priority than other
failures, no?

>> {
>> int rw = rq_data_dir(clone);
>> int run_queue = 1;
>> - bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
>> + bool is_flush = clone->cmd_flags & REQ_FLUSH;
>> struct dm_rq_target_io *tio = clone->end_io_data;
>> struct mapped_device *md = tio->md;
>> struct request *rq = tio->orig;
>>
>> - if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
>> + if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
>
> We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
> for the second half of this conditional.

I see.

>> + if (!is_flush)
>> + blk_end_request_all(rq, error);
>> + else {
>> if (unlikely(error))
>> - store_barrier_error(md, error);
>> + store_flush_error(md, error);
>> run_queue = 0;
>> - } else
>> - blk_end_request_all(rq, error);
>> + }
>
> Flush requests can now be completed normally.

The same question as before. I think we still need to prioritize
DM_ENDIO_REQUEUE failures.

>> [at] [at] -1417,11 +1419,11 [at] [at] static int _dm_request(struct request_qu
>> part_stat_unlock();
>>
>> /*
>> - * If we're suspended or the thread is processing barriers
>> + * If we're suspended or the thread is processing flushes
>> * we have to queue this io for later.
>> */
>> if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
>> - unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> + (bio->bi_rw & REQ_FLUSH)) {
>> up_read(&md->io_lock);
>
> AFAICS this is only needed for the old barrier code, no need for this
> for pure flushes.

I'll ask Mike.

>> [at] [at] -1464,10 +1466,7 [at] [at] static int dm_request(struct request_que
>>
>> static bool dm_rq_is_flush_request(struct request *rq)
>> {
>> - if (rq->cmd_flags & REQ_FLUSH)
>> - return true;
>> - else
>> - return false;
>> + return rq->cmd_flags & REQ_FLUSH;
>> }
>
> It's probably worth just killing this wrapper.

Yeah, probably. It was an accidental edit to begin with and I left
this part out in the new patch.

>> +static void process_flush(struct mapped_device *md, struct bio *bio)
>> {
>> + md->flush_error = 0;
>> +
>> + /* handle REQ_FLUSH */
>> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>>
>> - bio_init(&md->barrier_bio);
>> - md->barrier_bio.bi_bdev = md->bdev;
>> - md->barrier_bio.bi_rw = WRITE_BARRIER;
>> - __split_and_process_bio(md, &md->barrier_bio);
>> + bio_init(&md->flush_bio);
>> + md->flush_bio.bi_bdev = md->bdev;
>> + md->flush_bio.bi_rw = WRITE_FLUSH;
>> + __split_and_process_bio(md, &md->flush_bio);
>
> There's not need to use a separate flush_bio here.
> __split_and_process_bio does the right thing for empty REQ_FLUSH
> requests. See my patch for how to do this differenty. And yeah,
> my version has been tested.

But how do you make sure REQ_FLUSHes for preflush finish before
starting the write?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Di, 17 August 2010 11:59 ] [ ID #2046144 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
> I'm not really convinced that would help much. If bisecting can point
> to the conversion as the culprit for whatever kind of failure,
> wouldn't that be enough? No matter what we do the conversion will be
> a single step thing. If we make the filesystems enforce the ordering
> first and then relax ordering in the block layer, bisection would
> still just point at the later patch. The same goes for md/dm, the
> best we can find out would be whether the conversion is correct or not
> anyway.

The filesystems already enforce the ordering, except reiserfs which
opts out if the barrier options is set.

> I'm not against restructuring the patchset if it makes more sense but
> it just feels like it would be a bit pointless effort (and one which
> would require much tighter coordination among different trees) at this
> point. Am I missing something?

What other trees do you mean? The conversions of the 8 filesystems
that actually support barriers need to go through this tree anyway
if we want to be able to test it. Also the changes in the filesystem
are absolutely minimal - it's basically just
s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd,
and removing about 10 lines of code in reiserfs.

> > We only need the special md_flush_request handling for
> > empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
> > flag propagated to the underlying devices.
>
> Hmm, not really, the WRITE should happen after all the data in cache
> are committed to NV media, meaning that empty FLUSH should already
> have finished by the time the WRITE starts.

You're right.

> >> while ((bio = bio_list_pop(writes))) {
> >> - if (unlikely(bio_empty_barrier(bio))) {
> >> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> >
> > I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> > useful macro for the bio based drivers.
>
> Hmm... maybe. The reason why I removed bio_empty_flush() was that
> except for the front-most sequencer (block layer for all the request
> based ones and the front-most make_request for bio based ones), it
> doesn't make sense to see REQ_FLUSH + data bios. They should be
> sequenced at the front-most stage anyway, so I didn't have much use
> for them. Those code paths couldn't deal with REQ_FLUSH + data bios
> anyway.

The current bio_empty_barrier is only used in dm, and indeed only makes
sense for make_request-based drivers. But I think it's a rather useful
helper for them. Either way, it's not a big issue and either way is
fine with me.

> >> + if (bio->bi_rw & REQ_FLUSH) {
> >> /*
> >> - * There can be just one barrier request so we use
> >> + * There can be just one flush request so we use
> >> * a per-device variable for error reporting.
> >> * Note that you can't touch the bio after end_io_acct
> >> */
> >> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
> >> - md->barrier_error = io_error;
> >> + if (!md->flush_error)
> >> + md->flush_error = io_error;
> >
> > And we certainly do not need any special casing here. See my patch.
>
> I wasn't sure about that part. You removed store_flush_error(), but
> DM_ENDIO_REQUEUE should still have higher priority than other
> failures, no?

Which priority?

> >> +static void process_flush(struct mapped_device *md, struct bio *bio)
> >> {
> >> + md->flush_error = 0;
> >> +
> >> + /* handle REQ_FLUSH */
> >> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> >>
> >> - bio_init(&md->barrier_bio);
> >> - md->barrier_bio.bi_bdev = md->bdev;
> >> - md->barrier_bio.bi_rw = WRITE_BARRIER;
> >> - __split_and_process_bio(md, &md->barrier_bio);
> >> + bio_init(&md->flush_bio);
> >> + md->flush_bio.bi_bdev = md->bdev;
> >> + md->flush_bio.bi_rw = WRITE_FLUSH;
> >> + __split_and_process_bio(md, &md->flush_bio);
> >
> > There's not need to use a separate flush_bio here.
> > __split_and_process_bio does the right thing for empty REQ_FLUSH
> > requests. See my patch for how to do this differenty. And yeah,
> > my version has been tested.
>
> But how do you make sure REQ_FLUSHes for preflush finish before
> starting the write?

Hmm, okay. I see how the special flush_bio makes the waiting easier,
let's see if Mike or other in the DM team have a better idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Di, 17 August 2010 15:19 ] [ ID #2046146 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hi,

On 08/17/2010 03:19 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
>> I'm not against restructuring the patchset if it makes more sense but
>> it just feels like it would be a bit pointless effort (and one which
>> would require much tighter coordination among different trees) at this
>> point. Am I missing something?
>
> What other trees do you mean?

I was mostly thinking about dm/md, drdb and stuff, but you're talking
about filesystem conversion patches being routed through block tree,
right?

> The conversions of the 8 filesystems that actually support barriers
> need to go through this tree anyway if we want to be able to test
> it. Also the changes in the filesystem are absolutely minimal -
> it's basically just s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my
> initial patch kill BH_Orderd, and removing about 10 lines of code in
> reiserfs.

I might just resequence it to finish this part of discussion but what
does that really buy us? It's not really gonna help bisection.
Bisection won't be able to tell anything in higher resolution than
"the new implementation doesn't work". If you show me how it would
actually help, I'll happily reshuffle the patches.

>> I wasn't sure about that part. You removed store_flush_error(), but
>> DM_ENDIO_REQUEUE should still have higher priority than other
>> failures, no?
>
> Which priority?

IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
core layer to retry the whole bio later), it trumps all other failures
and the bio is retried later. That was why DM_ENDIO_REQUEUE was
prioritized over other error codes, which actually is sort of
incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
layers as FLUSH failure implies data already lost. So,
DM_ENDIO_REQUEUE actually should have lower priority than other
failures. But, then again, the error codes still need to be
prioritized.

>> But how do you make sure REQ_FLUSHes for preflush finish before
>> starting the write?
>
> Hmm, okay. I see how the special flush_bio makes the waiting easier,
> let's see if Mike or other in the DM team have a better idea.

Yeah, it would be better if it can be sequenced w/o using a work but
let's leave it for later.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Di, 17 August 2010 18:41 ] [ ID #2046154 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
> > What other trees do you mean?
>
> I was mostly thinking about dm/md, drdb and stuff, but you're talking
> about filesystem conversion patches being routed through block tree,
> right?

I think we really need all the conversions in one tree, block layer,
remapping drivers and filesystems.

Btw, I've done the conversion for all filesystems and I'm running tests
over them now. Expect the series late today or tomorrow.

> I might just resequence it to finish this part of discussion but what
> does that really buy us? It's not really gonna help bisection.
> Bisection won't be able to tell anything in higher resolution than
> "the new implementation doesn't work". If you show me how it would
> actually help, I'll happily reshuffle the patches.

It's not bisecting to find bugs in the barrier conversion. We can't
easily bisect it down anyway. The problem is when we try to bisect
other problems and get into the middle of the series barriers suddenly
are gone. Which is not very helpful for things like data integrity
problems in filesystems.

> >> I wasn't sure about that part. You removed store_flush_error(), but
> >> DM_ENDIO_REQUEUE should still have higher priority than other
> >> failures, no?
> >
> > Which priority?
>
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost. So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures. But, then again, the error codes still need to be
> prioritized.

I think that's something we better leave to the DM team.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Di, 17 August 2010 18:59 ] [ ID #2046157 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/17/2010 06:59 PM, Christoph Hellwig wrote:
> I think we really need all the conversions in one tree, block layer,
> remapping drivers and filesystems.

I don't know. If filesystem changes are really trivial maybe, but
md/dm changes seem a bit too invasive to go through the block tree.

> Btw, I've done the conversion for all filesystems and I'm running tests
> over them now. Expect the series late today or tomorrow.

Cool. :-)

>> I might just resequence it to finish this part of discussion but what
>> does that really buy us? It's not really gonna help bisection.
>> Bisection won't be able to tell anything in higher resolution than
>> "the new implementation doesn't work". If you show me how it would
>> actually help, I'll happily reshuffle the patches.
>
> It's not bisecting to find bugs in the barrier conversion. We can't
> easily bisect it down anyway. The problem is when we try to bisect
> other problems and get into the middle of the series barriers suddenly
> are gone. Which is not very helpful for things like data integrity
> problems in filesystems.

Ah, okay, hmmm.... alright, I'll resequence the patches. If the
filesystem changes can be put into a single tree somehow, we can keep
things mostly working at least for direct devices.

>> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
>> core layer to retry the whole bio later), it trumps all other failures
>> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
>> prioritized over other error codes, which actually is sort of
>> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
>> layers as FLUSH failure implies data already lost. So,
>> DM_ENDIO_REQUEUE actually should have lower priority than other
>> failures. But, then again, the error codes still need to be
>> prioritized.
>
> I think that's something we better leave to the DM team.

Sure, but we shouldn't be ripping out the code to do that.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Mi, 18 August 2010 08:35 ] [ ID #2046210 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/18/2010 08:35 AM, Tejun Heo wrote:
>> It's not bisecting to find bugs in the barrier conversion. We can't
>> easily bisect it down anyway. The problem is when we try to bisect
>> other problems and get into the middle of the series barriers suddenly
>> are gone. Which is not very helpful for things like data integrity
>> problems in filesystems.
>
> Ah, okay, hmmm.... alright, I'll resequence the patches. If the
> filesystem changes can be put into a single tree somehow, we can keep
> things mostly working at least for direct devices.

Sorry but I'm doing it. It just doesn't make much sense. I can't
relax the ordering for REQ_HARDBARRIER without breaking the remapping
drivers. So, to keep things working, I'll have to 1. relax the
ordering 2. implement new REQ_FLUSH/FUA based interface and 3. use
them in the filesystems in the same patch. That's just wrong. And I
don't think md/dm changes can or should go through the block tree.
They're way too invasive for that. It's a new implementation and
barrier won't work (fail gracefully) for several commits during the
transition. I don't think there's a better way around it.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Mi, 18 August 2010 10:11 ] [ ID #2046213 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

FYI: One issue with this series is that make_request based drivers
not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
need to add handling to empty REQ_FLUSH requests to all of them or
figure out a way to prevent them getting sent. That is assuming they'll
simply ignore REQ_FLUSH/REQ_FUA on normal writes.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Mi, 18 August 2010 11:46 ] [ ID #2046214 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Christoph Hellwig, on 08/13/2010 05:17 PM wrote:
> As far as playing with ordered tags it's just adding a new flag for
> it on the bio that gets passed down to the driver. For a final version
> you'd need a queue-level feature if it's supported, but you don't
> even need that for the initial work. Then you can implement a
> variant of blk_do_flush that does away with queueing additional requests
> once finish but queues all two or three at the same time with your
> new ordered flag set, at which point you are back to the level or
> ordered tag usage that the old code allows. You're still left with
> all the hard problems of actually implementing error handling for it
> and using it higher up in the filesystem and generic page cache code.

But how about file systems doing internal local order-by-drain? Without
converting them to use ordered commands it would be impossible to show
full potential of them and to make the conversion one would need deep
internal FS knowledge. That's my point. But if there's a trivial way to
see all such places in the filesystems code and convert, then OK, I agree.

> I'd really love to see your results, up to the point of just trying
> that once I get a little spare time. But my theory is that it won't
> help us - the problem with ordered tags is that they enforce global
> ordering while we currently have local ordering. While it will reduce
> the latency for the process waiting for an fsync or similar it will
> affect other I/O going on in the background and reduce the devices
> ability to reorder that I/O.

The local ordering vs global ordering is relevant only if you have
several applications/threads load. But how about a single
application/thread?

Another point, for which, AFAIU, the ORDERED commands were invented, is
that they make ordering on the _another_ side of the link _after_ all
link/transfer latencies. This is why it's hard to see advantage of them
on local disks.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin [ Mi, 18 August 2010 21:29 ] [ ID #2046221 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

Tejun Heo, on 08/13/2010 05:21 PM wrote:
>> If requested, I can develop the interface further.
>
> I still think the benefit of ordering by tag would be marginal at
> best, and what have you guys measured there? Under the current
> framework, there's no easy way to measure full ordered-by-tag
> implementation. The mechanism for filesystems to communicate the
> ordering information (which would be a partially ordered graph) just
> isn't there and there is no way the current usage of ordering-by-tag
> only for barrier sequence can achieve anything close to that level of
> difference.

Basically, I measured how iSCSI link utilization depends from amount of
queued commands and queued data size. This is why I made it as a table.
From it you can see which improvement you will have removing queue
draining after 1, 2, 4, etc. commands depending of commands sizes.

For instance, on my previous XFS rm example, where rm of 4 files took
3.5 minutes with nobarrier option, I could see that XFS was sending 1-3
32K commands in a row. From my table you can see that if it sent all
them at once without draining, it would have about 150-200% speed increase.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin [ Mi, 18 August 2010 21:30 ] [ ID #2046222 ]

Re: [PATCH 03/11] block: deprecate barrier and replaceblk_queue_ordered() with blk_queue_flush()

On 08/14/2010 02:42 AM, hch [at] lst.de wrote:
> On Fri, Aug 13, 2010 at 06:07:13PM -0700, Jeremy Fitzhardinge wrote:
>> On 08/12/2010 05:41 AM, Tejun Heo wrote:
>>> Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
>>> requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
>>> -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
>>> blk_queue_flush().
>>>
>>> blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
>>> device has write cache and can flush it, it should set REQ_FLUSH. If
>>> the device can handle FUA writes, it should also set REQ_FUA.
>> Christoph, do these two patches (parts 2 and 3) make xen-blkfront
>> correct WRT barriers/flushing as far as your concerned?
> If all your backends handle a zero-length BLKIF_OP_WRITE_BARRIER request
> it is a fully correct, but rather suboptimal implementation. To get
> all the benefit of the new non-draining barriers you'll need a new
> If all your backends handle a zero-length BLKIF_OP_FLUSH request that
> only flushes the cache, but has no ordering side effects.

Is the effect of the flush that, once complete, any previously completed
write is guaranteed to be on durable storage, but it is not guaranteed
to have any effect on pending writes? If so, does it flush writes that
were completed before the flush is issued, or writes that complete
before the flush completes?

> Note that
> the quite suboptimal here means not as good as the new barrier
> implementation, but it shouldn't be notiably worse than the old one
> for Xen.

OK, thanks. We can do some testing on that and see if there's a benefit
to adding a flush operation with the appropriate semantics.

J
Jeremy Fitzhardinge [ Mo, 16 August 2010 22:38 ] [ ID #2046278 ]

Re: [PATCH 03/11] block: deprecate barrier and replaceblk_queue_ordered() with blk_queue_flush()

On 08/12/2010 05:41 AM, Tejun Heo wrote:
> Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
> requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
> -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
> blk_queue_flush().
>
> blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
> device has write cache and can flush it, it should set REQ_FLUSH. If
> the device can handle FUA writes, it should also set REQ_FUA.

Christoph, do these two patches (parts 2 and 3) make xen-blkfront
correct WRT barriers/flushing as far as your concerned?

Thanks,
J

> All blk_queue_ordered() users are converted.
>
> * ORDERED_DRAIN is mapped to 0 which is the default value.
> * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
> * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
>
> Signed-off-by: Tejun Heo <tj [at] kernel.org>
> Cc: Christoph Hellwig <hch [at] infradead.org>
> Cc: Nick Piggin <npiggin [at] kernel.dk>
> Cc: Michael S. Tsirkin <mst [at] redhat.com>
> Cc: Jeremy Fitzhardinge <jeremy [at] xensource.com>
> Cc: Chris Wright <chrisw [at] sous-sol.org>
> Cc: FUJITA Tomonori <fujita.tomonori [at] lab.ntt.co.jp>
> Cc: Boaz Harrosh <bharrosh [at] panasas.com>
> Cc: Geert Uytterhoeven <Geert.Uytterhoeven [at] sonycom.com>
> Cc: David S. Miller <davem [at] davemloft.net>
> Cc: Alasdair G Kergon <agk [at] redhat.com>
> Cc: Pierre Ossman <drzeus [at] drzeus.cx>
> Cc: Stefan Weinhuber <wein [at] de.ibm.com>
> ---
> block/blk-barrier.c | 29 ----------------------------
> block/blk-core.c | 6 +++-
> block/blk-settings.c | 20 +++++++++++++++++++
> drivers/block/brd.c | 1 -
> drivers/block/loop.c | 2 +-
> drivers/block/osdblk.c | 2 +-
> drivers/block/ps3disk.c | 2 +-
> drivers/block/virtio_blk.c | 25 ++++++++---------------
> drivers/block/xen-blkfront.c | 43 +++++++++++------------------------------
> drivers/ide/ide-disk.c | 13 +++++------
> drivers/md/dm.c | 2 +-
> drivers/mmc/card/queue.c | 1 -
> drivers/s390/block/dasd.c | 1 -
> drivers/scsi/sd.c | 16 +++++++-------
> include/linux/blkdev.h | 6 +++-
> 15 files changed, 67 insertions(+), 102 deletions(-)
>
> diff --git a/block/blk-barrier.c b/block/blk-barrier.c
> index c807e9c..ed0aba5 100644
> --- a/block/blk-barrier.c
> +++ b/block/blk-barrier.c
> [at] [at] -9,35 +9,6 [at] [at]
>
> #include "blk.h"
>
> -/**
> - * blk_queue_ordered - does this queue support ordered writes
> - * [at] q: the request queue
> - * [at] ordered: one of QUEUE_ORDERED_*
> - *
> - * Description:
> - * For journalled file systems, doing ordered writes on a commit
> - * block instead of explicitly doing wait_on_buffer (which is bad
> - * for performance) can be a big win. Block drivers supporting this
> - * feature should call this function and indicate so.
> - *
> - **/
> -int blk_queue_ordered(struct request_queue *q, unsigned ordered)
> -{
> - if (ordered != QUEUE_ORDERED_NONE &&
> - ordered != QUEUE_ORDERED_DRAIN &&
> - ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
> - ordered != QUEUE_ORDERED_DRAIN_FUA) {
> - printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
> - return -EINVAL;
> - }
> -
> - q->ordered = ordered;
> - q->next_ordered = ordered;
> -
> - return 0;
> -}
> -EXPORT_SYMBOL(blk_queue_ordered);
> -
> /*
> * Cache flushing for ordered writes handling
> */
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5ab3ac2..3f802dd 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> [at] [at] -1203,11 +1203,13 [at] [at] static int __make_request(struct request_queue *q, struct bio *bio)
> const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
> int rw_flags;
>
> - if ((bio->bi_rw & REQ_HARDBARRIER) &&
> - (q->next_ordered == QUEUE_ORDERED_NONE)) {
> + /* REQ_HARDBARRIER is no more */
> + if (WARN_ONCE(bio->bi_rw & REQ_HARDBARRIER,
> + "block: HARDBARRIER is deprecated, use FLUSH/FUA instead\n")) {
> bio_endio(bio, -EOPNOTSUPP);
> return 0;
> }
> +
> /*
> * low level driver can indicate that it wants pages above a
> * certain limit bounced to low memory (ie for highmem, or even
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a234f4b..9b18afc 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> [at] [at] -794,6 +794,26 [at] [at] void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
> }
> EXPORT_SYMBOL(blk_queue_update_dma_alignment);
>
> +/**
> + * blk_queue_flush - configure queue's cache flush capability
> + * [at] q: the request queue for the device
> + * [at] flush: 0, REQ_FLUSH or REQ_FLUSH | REQ_FUA
> + *
> + * Tell block layer cache flush capability of [at] q. If it supports
> + * flushing, REQ_FLUSH should be set. If it supports bypassing
> + * write cache for individual writes, REQ_FUA should be set.
> + */
> +void blk_queue_flush(struct request_queue *q, unsigned int flush)
> +{
> + WARN_ON_ONCE(flush & ~(REQ_FLUSH | REQ_FUA));
> +
> + if (WARN_ON_ONCE(!(flush & REQ_FLUSH) && (flush & REQ_FUA)))
> + flush &= ~REQ_FUA;
> +
> + q->flush_flags = flush & (REQ_FLUSH | REQ_FUA);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_flush);
> +
> static int __init blk_settings_init(void)
> {
> blk_max_low_pfn = max_low_pfn - 1;
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index 47a4127..fa33f97 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> [at] [at] -482,7 +482,6 [at] [at] static struct brd_device *brd_alloc(int i)
> if (!brd->brd_queue)
> goto out_free_dev;
> blk_queue_make_request(brd->brd_queue, brd_make_request);
> - blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
> blk_queue_max_hw_sectors(brd->brd_queue, 1024);
> blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c3a4a2e..953d1e1 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> [at] [at] -832,7 +832,7 [at] [at] static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> lo->lo_queue->unplug_fn = loop_unplug;
>
> if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
> - blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
> + blk_queue_flush(lo->lo_queue, REQ_FLUSH);
>
> set_capacity(lo->lo_disk, size);
> bd_set_size(bdev, size << 9);
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index 2284b4f..72d6246 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> [at] [at] -439,7 +439,7 [at] [at] static int osdblk_init_disk(struct osdblk_device *osdev)
> blk_queue_stack_limits(q, osd_request_queue(osdev->osd));
>
> blk_queue_prep_rq(q, blk_queue_start_tag);
> - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
> + blk_queue_flush(q, REQ_FLUSH);
>
> disk->queue = q;
>
> diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
> index e9da874..4911f9e 100644
> --- a/drivers/block/ps3disk.c
> +++ b/drivers/block/ps3disk.c
> [at] [at] -468,7 +468,7 [at] [at] static int __devinit ps3disk_probe(struct ps3_system_bus_device *_dev)
> blk_queue_dma_alignment(queue, dev->blk_size-1);
> blk_queue_logical_block_size(queue, dev->blk_size);
>
> - blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH);
> + blk_queue_flush(queue, REQ_FLUSH);
>
> blk_queue_max_segments(queue, -1);
> blk_queue_max_segment_size(queue, dev->bounce_size);
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 7965280..d10b635 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> [at] [at] -388,22 +388,15 [at] [at] static int __devinit virtblk_probe(struct virtio_device *vdev)
> vblk->disk->driverfs_dev = &vdev->dev;
> index++;
>
> - if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
> - /*
> - * If the FLUSH feature is supported we do have support for
> - * flushing a volatile write cache on the host. Use that
> - * to implement write barrier support.
> - */
> - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
> - } else {
> - /*
> - * If the FLUSH feature is not supported we must assume that
> - * the host does not perform any kind of volatile write
> - * caching. We still need to drain the queue to provider
> - * proper barrier semantics.
> - */
> - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN);
> - }
> + /*
> + * If the FLUSH feature is supported we do have support for
> + * flushing a volatile write cache on the host. Use that to
> + * implement write barrier support; otherwise, we must assume
> + * that the host does not perform any kind of volatile write
> + * caching.
> + */
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
> + blk_queue_flush(q, REQ_FLUSH);
>
> /* If disk is read-only in the host, the guest should obey */
> if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 25ffbf9..1d48f3a 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> [at] [at] -95,7 +95,7 [at] [at] struct blkfront_info
> struct gnttab_free_callback callback;
> struct blk_shadow shadow[BLK_RING_SIZE];
> unsigned long shadow_free;
> - int feature_barrier;
> + unsigned int feature_flush;
> int is_ready;
> };
>
> [at] [at] -418,25 +418,12 [at] [at] static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
> }
>
>
> -static int xlvbd_barrier(struct blkfront_info *info)
> +static void xlvbd_flush(struct blkfront_info *info)
> {
> - int err;
> - const char *barrier;
> -
> - switch (info->feature_barrier) {
> - case QUEUE_ORDERED_DRAIN: barrier = "enabled"; break;
> - case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
> - default: return -EINVAL;
> - }
> -
> - err = blk_queue_ordered(info->rq, info->feature_barrier);
> -
> - if (err)
> - return err;
> -
> + blk_queue_flush(info->rq, info->feature_flush);
> printk(KERN_INFO "blkfront: %s: barriers %s\n",
> - info->gd->disk_name, barrier);
> - return 0;
> + info->gd->disk_name,
> + info->feature_flush ? "enabled" : "disabled");
> }
>
>
> [at] [at] -515,7 +502,7 [at] [at] static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
> info->rq = gd->queue;
> info->gd = gd;
>
> - xlvbd_barrier(info);
> + xlvbd_flush(info);
>
> if (vdisk_info & VDISK_READONLY)
> set_disk_ro(gd, 1);
> [at] [at] -661,8 +648,8 [at] [at] static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
> info->gd->disk_name);
> error = -EOPNOTSUPP;
> - info->feature_barrier = QUEUE_ORDERED_NONE;
> - xlvbd_barrier(info);
> + info->feature_flush = 0;
> + xlvbd_flush(info);
> }
> /* fall through */
> case BLKIF_OP_READ:
> [at] [at] -1075,19 +1062,13 [at] [at] static void blkfront_connect(struct blkfront_info *info)
> /*
> * If there's no "feature-barrier" defined, then it means
> * we're dealing with a very old backend which writes
> - * synchronously; draining will do what needs to get done.
> + * synchronously; nothing to do.
> *
> * If there are barriers, then we use flush.
> - *
> - * If barriers are not supported, then there's no much we can
> - * do, so just set ordering to NONE.
> */
> - if (err)
> - info->feature_barrier = QUEUE_ORDERED_DRAIN;
> - else if (barrier)
> - info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
> - else
> - info->feature_barrier = QUEUE_ORDERED_NONE;
> + info->feature_flush = 0;
> + if (!err && barrier)
> + info->feature_flush = REQ_FLUSH;
>
> err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
> if (err) {
> diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
> index 7433e07..7c5b01c 100644
> --- a/drivers/ide/ide-disk.c
> +++ b/drivers/ide/ide-disk.c
> [at] [at] -516,10 +516,10 [at] [at] static int ide_do_setfeature(ide_drive_t *drive, u8 feature, u8 nsect)
> return ide_no_data_taskfile(drive, &cmd);
> }
>
> -static void update_ordered(ide_drive_t *drive)
> +static void update_flush(ide_drive_t *drive)
> {
> u16 *id = drive->id;
> - unsigned ordered = QUEUE_ORDERED_NONE;
> + unsigned flush = 0;
>
> if (drive->dev_flags & IDE_DFLAG_WCACHE) {
> unsigned long long capacity;
> [at] [at] -543,13 +543,12 [at] [at] static void update_ordered(ide_drive_t *drive)
> drive->name, barrier ? "" : "not ");
>
> if (barrier) {
> - ordered = QUEUE_ORDERED_DRAIN_FLUSH;
> + flush = REQ_FLUSH;
> blk_queue_prep_rq(drive->queue, idedisk_prep_fn);
> }
> - } else
> - ordered = QUEUE_ORDERED_DRAIN;
> + }
>
> - blk_queue_ordered(drive->queue, ordered);
> + blk_queue_flush(drive->queue, flush);
> }
>
> ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE);
> [at] [at] -572,7 +571,7 [at] [at] static int set_wcache(ide_drive_t *drive, int arg)
> }
> }
>
> - update_ordered(drive);
> + update_flush(drive);
>
> return err;
> }
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index a3f21dc..b71cc9e 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> [at] [at] -1908,7 +1908,7 [at] [at] static struct mapped_device *alloc_dev(int minor)
> blk_queue_softirq_done(md->queue, dm_softirq_done);
> blk_queue_prep_rq(md->queue, dm_prep_fn);
> blk_queue_lld_busy(md->queue, dm_lld_busy);
> - blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH);
> + blk_queue_flush(md->queue, REQ_FLUSH);
>
> md->disk = alloc_disk(1);
> if (!md->disk)
> diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
> index c77eb49..d791772 100644
> --- a/drivers/mmc/card/queue.c
> +++ b/drivers/mmc/card/queue.c
> [at] [at] -128,7 +128,6 [at] [at] int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
> mq->req = NULL;
>
> blk_queue_prep_rq(mq->queue, mmc_prep_request);
> - blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN);
> queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
>
> #ifdef CONFIG_MMC_BLOCK_BOUNCE
> diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
> index 1a84fae..29046b7 100644
> --- a/drivers/s390/block/dasd.c
> +++ b/drivers/s390/block/dasd.c
> [at] [at] -2197,7 +2197,6 [at] [at] static void dasd_setup_queue(struct dasd_block *block)
> */
> blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
> blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
> - blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN);
> }
>
> /*
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 05a15b0..7f6aca2 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> [at] [at] -2109,7 +2109,7 [at] [at] static int sd_revalidate_disk(struct gendisk *disk)
> struct scsi_disk *sdkp = scsi_disk(disk);
> struct scsi_device *sdp = sdkp->device;
> unsigned char *buffer;
> - unsigned ordered;
> + unsigned flush = 0;
>
> SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp,
> "sd_revalidate_disk\n"));
> [at] [at] -2151,15 +2151,15 [at] [at] static int sd_revalidate_disk(struct gendisk *disk)
>
> /*
> * We now have all cache related info, determine how we deal
> - * with ordered requests.
> + * with flush requests.
> */
> - if (sdkp->WCE)
> - ordered = sdkp->DPOFUA
> - ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
> - else
> - ordered = QUEUE_ORDERED_DRAIN;
> + if (sdkp->WCE) {
> + flush |= REQ_FLUSH;
> + if (sdkp->DPOFUA)
> + flush |= REQ_FUA;
> + }
>
> - blk_queue_ordered(sdkp->disk->queue, ordered);
> + blk_queue_flush(sdkp->disk->queue, flush);
>
> set_capacity(disk, sdkp->capacity);
> kfree(buffer);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 96ef5f1..6003f7c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> [at] [at] -355,8 +355,10 [at] [at] struct request_queue
> struct blk_trace *blk_trace;
> #endif
> /*
> - * reserved for flush operations
> + * for flush operations
> */
> + unsigned int flush_flags;
> +
> unsigned int ordered, next_ordered, ordseq;
> int orderr, ordcolor;
> struct request pre_flush_rq, bar_rq, post_flush_rq;
> [at] [at] -863,8 +865,8 [at] [at] extern void blk_queue_update_dma_alignment(struct request_queue *, int);
> extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
> extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
> extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
> +extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
> extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
> -extern int blk_queue_ordered(struct request_queue *, unsigned);
> extern bool blk_do_ordered(struct request_queue *, struct request **);
> extern unsigned blk_ordered_cur_seq(struct request_queue *);
> extern unsigned blk_ordered_req_seq(struct request *);
> --
> 1.7.1
>
Jeremy Fitzhardinge [ Sa, 14 August 2010 03:07 ] [ ID #2046279 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/18/2010 09:30 PM, Vladislav Bolkhovitin wrote:
> Basically, I measured how iSCSI link utilization depends from amount
> of queued commands and queued data size. This is why I made it as a
> table. From it you can see which improvement you will have removing
> queue draining after 1, 2, 4, etc. commands depending of commands
> sizes.
>
> For instance, on my previous XFS rm example, where rm of 4 files
> took 3.5 minutes with nobarrier option, I could see that XFS was
> sending 1-3 32K commands in a row. From my table you can see that if
> it sent all them at once without draining, it would have about
> 150-200% speed increase.

You compared barrier off/on. Of course, it will make a big
difference. I think good part of that gain should be realized by the
currently proposed patchset which removes draining. What's needed to
be demonstrated is the difference between ordered-by-waiting and
ordered-by-tag. We've never had code to do that properly.

The original ordered-by-tag we had only applied tag ordering to two or
three command sequences inside a barrier, which doesn't amount to much
(and could even be harmful as it imposes draining of all simple
commands inside the device only to reduce issue latencies for a few
commands). You'll need to hook into filesystem and somehow export the
ordering information down to the driver so that whatever needs
ordering is sent out as ordered commands.

As I've wrote multiple times, I'm pretty skeptical it will bring much.
Ordered tag mandates draining inside the device just like the original
barrier implementation. Sure, it's done at a lower layer and command
issue latencies will be reduced thanks to that but ordered-by-waiting
doesn't require _any_ draining at all. The whole pipeline can be kept
full all the time. I'm often wrong tho, so please feel free to go
ahead and prove me wrong. :-)

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Do, 19 August 2010 11:51 ] [ ID #2046283 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/18/2010 11:46 AM, Christoph Hellwig wrote:
> FYI: One issue with this series is that make_request based drivers
> not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
> need to add handling to empty REQ_FLUSH requests to all of them or
> figure out a way to prevent them getting sent. That is assuming they'll
> simply ignore REQ_FLUSH/REQ_FUA on normal writes.

Can you be a bit more specific? In most cases, request based drivers
should be fine. They sit behind the front most request_queue which
would discompose REQ_FLUSH/FUAs into appropriate command sequence.
For the request based drivers, it's not different from the original
REQ_HARDBARRIER mechanism, it'll just see flushes and optionally FUA
writes.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Do, 19 August 2010 11:57 ] [ ID #2046284 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Thu, Aug 19, 2010 at 11:57:53AM +0200, Tejun Heo wrote:
> On 08/18/2010 11:46 AM, Christoph Hellwig wrote:
> > FYI: One issue with this series is that make_request based drivers
> > not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
> > need to add handling to empty REQ_FLUSH requests to all of them or
> > figure out a way to prevent them getting sent. That is assuming they'll
> > simply ignore REQ_FLUSH/REQ_FUA on normal writes.
>
> Can you be a bit more specific? In most cases, request based drivers
> should be fine. They sit behind the front most request_queue which
> would discompose REQ_FLUSH/FUAs into appropriate command sequence.

I said make_request based drivers, that is drivers taking bios. These
get bios directly from __generic_make_request and need to deal with
REQ_FLUSH/FUA themselves. We have quite a few more than just dm/md of
this kind:

arch/powerpc/sysdev/axonram.c: blk_queue_make_request(bank->disk->queue, axon_ram_make_request);
drivers/block/aoe/aoeblk.c: blk_queue_make_request(d->blkq, aoeblk_make_request);
drivers/block/brd.c: blk_queue_make_request(brd->brd_queue, brd_make_request);
drivers/block/drbd/drbd_main.c: blk_queue_make_request(q, drbd_make_request_26);
drivers/block/loop.c: blk_queue_make_request(lo->lo_queue, loop_make_request);
drivers/block/pktcdvd.c: blk_queue_make_request(q, pkt_make_request);
drivers/block/ps3vram.c: blk_queue_make_request(queue, ps3vram_make_request);
drivers/block/umem.c: blk_queue_make_request(card->queue, mm_make_request);
drivers/s390/block/dcssblk.c: blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
drivers/s390/block/xpram.c: blk_queue_make_request(xpram_queues[i], xpram_make_request);
drivers/staging/zram/zram_drv.c:blk_queue_make_request(zram- >queue, zram_make_request);


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Do, 19 August 2010 12:20 ] [ ID #2046285 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/19/2010 12:20 PM, Christoph Hellwig wrote:
> I said make_request based drivers, that is drivers taking bios.

Right. Gees, it's confusing.

> These get bios directly from __generic_make_request and need to deal
> with REQ_FLUSH/FUA themselves. We have quite a few more than just
> dm/md of this kind:
>
> arch/powerpc/sysdev/axonram.c
> drivers/block/aoe/aoeblk.c
> drivers/block/brd.c

I'll try to convert these three.

> drivers/block/drbd/drbd_main.c

I'd rather leave drbd to its maintainers.

> drivers/block/loop.c

Already converted.

> drivers/block/pktcdvd.c
> drivers/block/ps3vram.c
> drivers/block/umem.c
> drivers/s390/block/dcssblk.c
> drivers/s390/block/xpram.c
> drivers/staging/zram/zram_drv.c

Will work on these.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Do, 19 August 2010 12:22 ] [ ID #2046286 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hi Tejun, Christoph,

On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
>>> I wasn't sure about that part. You removed store_flush_error(), but
>>> DM_ENDIO_REQUEUE should still have higher priority than other
>>> failures, no?
>>
>> Which priority?
>
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost. So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures. But, then again, the error codes still need to be
> prioritized.

I think that's correct and changing the priority of DM_ENDIO_REQUEUE
for REQ_FLUSH down to the lowest should be fine.
(I didn't know that FLUSH failure implies data loss possibility.)

But the patch is not enough, you have to change target drivers, too.
E.g. As for multipath, you need to change
drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
like the REQ_DISCARD support included in 2.6.36-rc1.


By the way, if these patch-set with the change above are included,
even one path failure for REQ_FLUSH on multipath configuration will
be reported to upper layer as error, although it's retried using
other paths currently.
Then, if an upper layer won't take correct recovery action for the error,
it would be seen as a regression for users. (e.g. Frequent EXT3-error
resulting in read-only mount on multipath configuration.)

Although I think the explicit error is fine rather than implicit data
corruption, please check upper layers carefully so that users won't see
such errors as much as possible.

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kiyoshi Ueda [ Fr, 20 August 2010 10:26 ] [ ID #2046355 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

FYI: here's a little writeup to document the new cache flushing scheme,
intended to replace Documentation/block/barriers.txt. Any good
suggestion for a filename in the kernel tree?

---

Explicit volatile write cache control
=====================================

Introduction
------------

Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the physical medium. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the physical medium when it performs
a data integrity operation like fsync, sync or an unmount.

The Linux block layer provides a two simple mechanism that lets filesystems
control the caching behavior of the storage device. These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.


Explicit cache flushes
----------------------

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. The explicit
guarantees write requests that have completed before the bio was submitted
actually are on the physical medium before this request has started.
In addition the REQ_FLUSH flag can be set on an otherwise empty bio
structure, which causes only an explicit cache flush without any dependent
I/O. It is recommend to use the blkdev_issue_flush() helper for a pure
cache flush.


Forced Unit Access
-----------------

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is not
signaled before the data has made it to non-volatile storage on the
physical medium.


Implementation details for filesystems
--------------------------------------

Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.


Implementation details for make_request_fn based block drivers
------------------------------------------------------------ --

These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface. For remapping drivers the REQ_FUA
bits needs to be propagate to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set. For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work. Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.


Implementation details for request_fn based block drivers
------------------------------------------------------------ --

For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload. For device with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);

and handle empty REQ_FLUSH requests in it's prep_fn/request_fn. Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of empty REQ_FLUSH and the actual write by the block layer. For devices
that also support the FUA bit the block layer needs to be told to pass
through that bit using:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);

and handle write requests that have the REQ_FUA bit set properly in it's
prep_fn/request_fn. If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH requests after the actual write.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig [ Fr, 20 August 2010 15:22 ] [ ID #2046356 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
> FYI: here's a little writeup to document the new cache flushing scheme,
> intended to replace Documentation/block/barriers.txt. Any good
> suggestion for a filename in the kernel tree?
>
> ---

I was thinking that we might be better off using the "durable writes" term more
since it is well documented (at least in the database world, where it is the "D"
in ACID properties). Maybe "durable_writes_support.txt" ?


>
> Explicit volatile write cache control
> =====================================
>
> Introduction
> ------------
>
> Many storage devices, especially in the consumer market, come with volatile
> write back caches. That means the devices signal I/O completion to the
> operating system before data actually has hit the physical medium. This
> behavior obviously speeds up various workloads, but it means the operating
> system needs to force data out to the physical medium when it performs
> a data integrity operation like fsync, sync or an unmount.
>
> The Linux block layer provides a two simple mechanism that lets filesystems
> control the caching behavior of the storage device. These mechanisms are
> a forced cache flush, and the Force Unit Access (FUA) flag for requests.
>

Should we mention that users can also disable the write cache on the target device?

It might also be worth mentioning that storage needs to be properly configured -
i.e., an internal hardware RAID card with battery backing needs can expose
itself as a writethrough cache *only if* it actually has control over all of the
backend disks and can flush/disable their write caches.

Maybe that is too much detail, but I know that people have lost data with some
of these setups.

The rest of the write up below sounds good, thanks for pulling this together!

Ric


>
> Explicit cache flushes
> ----------------------
>
> The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure the volatile cache of the storage device
> has been flushed before the actual I/O operation is started. The explicit
> guarantees write requests that have completed before the bio was submitted
> actually are on the physical medium before this request has started.
> In addition the REQ_FLUSH flag can be set on an otherwise empty bio
> structure, which causes only an explicit cache flush without any dependent
> I/O. It is recommend to use the blkdev_issue_flush() helper for a pure
> cache flush.
>
>
> Forced Unit Access
> -----------------
>
> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure that I/O completion for this requests is not
> signaled before the data has made it to non-volatile storage on the
> physical medium.
>
>
> Implementation details for filesystems
> --------------------------------------
>
> Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
> worry if the underlying devices need any explicit cache flushing and how
> the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
> may both be set on a single bio.
>
>
> Implementation details for make_request_fn based block drivers
> ------------------------------------------------------------ --
>
> These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
> directly below the submit_bio interface. For remapping drivers the REQ_FUA
> bits needs to be propagate to underlying devices, and a global flush needs
> to be implemented for bios with the REQ_FLUSH bit set. For real device
> drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
> on non-empty bios can simply be ignored, and REQ_FLUSH requests without
> data can be completed successfully without doing any work. Drivers for
> devices with volatile caches need to implement the support for these
> flags themselves without any help from the block layer.
>
>
> Implementation details for request_fn based block drivers
> ------------------------------------------------------------ --
>
> For devices that do not support volatile write caches there is no driver
> support required, the block layer completes empty REQ_FLUSH requests before
> entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
> requests that have a payload. For device with volatile write caches the
> driver needs to tell the block layer that it supports flushing caches by
> doing:
>
> blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
>
> and handle empty REQ_FLUSH requests in it's prep_fn/request_fn. Note that
> REQ_FLUSH requests with a payload are automatically turned into a sequence
> of empty REQ_FLUSH and the actual write by the block layer. For devices
> that also support the FUA bit the block layer needs to be told to pass
> through that bit using:
>
> blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
>
> and handle write requests that have the REQ_FUA bit set properly in it's
> prep_fn/request_fn. If the FUA bit is not natively supported the block
> layer turns it into an empty REQ_FLUSH requests after the actual write.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler [ Fr, 20 August 2010 17:18 ] [ ID #2046357 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

On Fri, Aug 20, 2010 at 11:18:07AM -0400, Ric Wheeler wrote:
> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
> >FYI: here's a little writeup to document the new cache flushing scheme,
> >intended to replace Documentation/block/barriers.txt. Any good
> >suggestion for a filename in the kernel tree?
> >
> >---
>
> I was thinking that we might be better off using the "durable
> writes" term more since it is well documented (at least in the
> database world, where it is the "D" in ACID properties). Maybe
> "durable_writes_support.txt" ?

sata_lies.txt?

Ok, maybe writeback_cache.txt?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason [ Fr, 20 August 2010 18:00 ] [ ID #2046358 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

On 08/20/2010 12:00 PM, Chris Mason wrote:
> On Fri, Aug 20, 2010 at 11:18:07AM -0400, Ric Wheeler wrote:
>> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
>>> FYI: here's a little writeup to document the new cache flushing scheme,
>>> intended to replace Documentation/block/barriers.txt. Any good
>>> suggestion for a filename in the kernel tree?
>>>
>>> ---
>>
>> I was thinking that we might be better off using the "durable
>> writes" term more since it is well documented (at least in the
>> database world, where it is the "D" in ACID properties). Maybe
>> "durable_writes_support.txt" ?
>
> sata_lies.txt?
>
> Ok, maybe writeback_cache.txt?
>
> -chris

writeback_cache.txt is certainly the least confusing :)

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler [ Fr, 20 August 2010 18:02 ] [ ID #2046359 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
> I think that's correct and changing the priority of DM_ENDIO_REQUEUE
> for REQ_FLUSH down to the lowest should be fine.
> (I didn't know that FLUSH failure implies data loss possibility.)

At least on ATA, FLUSH failure implies that data is already lost, so
the error can't be ignored or retried.

> But the patch is not enough, you have to change target drivers, too.
> E.g. As for multipath, you need to change
> drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
> like the REQ_DISCARD support included in 2.6.36-rc1.

I'll take a look but is there an easy to test mpath other than having
fancy hardware?

> By the way, if these patch-set with the change above are included,
> even one path failure for REQ_FLUSH on multipath configuration will
> be reported to upper layer as error, although it's retried using
> other paths currently.
> Then, if an upper layer won't take correct recovery action for the error,
> it would be seen as a regression for users. (e.g. Frequent EXT3-error
> resulting in read-only mount on multipath configuration.)
>
> Although I think the explicit error is fine rather than implicit data
> corruption, please check upper layers carefully so that users won't see
> such errors as much as possible.

Argh... then it will have to discern why FLUSH failed. It can retry
for transport errors but if it got aborted by the device it should
report upwards. Maybe just turn off barrier support in mpath for now?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Mo, 23 August 2010 14:14 ] [ ID #2046440 ]

Re: [PATCHSET block#for-2.6.36-post] block: replace barrier withsequenced flush

Hello,

On 08/20/2010 05:18 PM, Ric Wheeler wrote:
> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
>> FYI: here's a little writeup to document the new cache flushing scheme,
>> intended to replace Documentation/block/barriers.txt. Any good
>> suggestion for a filename in the kernel tree?
>>
>
> I was thinking that we might be better off using the "durable
> writes" term more since it is well documented (at least in the
> database world, where it is the "D" in ACID properties). Maybe
> "durable_writes_support.txt" ?

The term is very foreign to people outside of enterprise / database
loop. writeback-cache.txt or write-cache-control.txt sounds good
enough to me.

>> The Linux block layer provides a two simple mechanism that lets filesystems
>> control the caching behavior of the storage device. These mechanisms are
>> a forced cache flush, and the Force Unit Access (FUA) flag for requests.
>>
>
> Should we mention that users can also disable the write cache on the
> target device?
>
> It might also be worth mentioning that storage needs to be properly
> configured - i.e., an internal hardware RAID card with battery
> backing needs can expose itself as a writethrough cache *only if* it
> actually has control over all of the backend disks and can
> flush/disable their write caches.

It might be useful to give several example configurations with
different cache configurations. I don't have much experience with
battery backed arrays but aren't they suppose to report write through
cache automatically?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo [ Mo, 23 August 2010 14:30 ] [ ID #2046441 ]
Linux » gmane.linux.raid » [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Vorheriges Thema: filesystem on mdadm raid from unpartioned block devices
Nächstes Thema: [PATCH 0/2] fixes for manually-added spares in raid5